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Nederlandstalige samenvatting 


In deze doctoraatsverhandeling wordt een abstract meetkundig formalisme 
ontwikkeld voor het modelleren van data die zeer algemeen mogen zijn. 
Dit wiskundig kader werd het “data set model formalisme” genoemd en is 
gei'nspireerd op de informatiemeetkunde. Het modelleren—of fitten—gebeurt 
met behulp van een divergentiefunctie: een veralgemeende afstandsmaat 
waarvan de relatieve entropie waarschijnlijk het bekendste voorbeeld is. Door 
te eisen dat deze modellen een differentiaalmeetkundige varieteit vormen kun- 
nen zij worden uitgerust met een meetkundige structuur dewelke volgt nit 
de divergentiefunctie. Het belang van deze structuur is dat zij toestaat de 
belangrijkste eigenschappen van het modelleringsproces kwantitatief te be- 
schrijven. Centraal hierin staat de zogenaamde Hessiaanse structuur, die het 
mogelijk maakt de Riemanniaanse metriek van de modelvarieteit af te leiden 
nit een familie van scalaire functies. Dit vereist de keuze van een vlakke 
affiene connectie, waarvoor tevens een constructie nit de divergentiefunctie 
gegeven wordt. 

Het data set model formalisme biedt een aantal voordelen ten opzichte van 
bestaande modelleringstechnieken. Het belangrijkste daarvan is de grote wis- 
kundige flexibiliteit, die te danken is aan het gebruik van different iaalmeet- 
kunde. Vanuit theoretisch oogpunt is vooral de mogelijkheid om een breed 
scala aan verschillende modelleringsproblemen in eenzelfde kader onder te 
brengen interessant. Zo omvat het formalisme de informatiemeetkunde maar 
eveneens regressiemodellen en elementen van de kwantumstatistiek. Ook 
voor praktische toepassingen is dit werk interessant. De reden hiervoor is 
dat de ontwikkelde technieken toestaan gegeven data via meetkundige me- 
thodes te modelleren, zelfs wanneer de modellen kwalitatief verschillen van 
de data. Dit opent perspectieven voor toepassingen binnen het vakgebied 
van machinaal leren [machine learning) en voor het ontwikkelen van een uit- 
breiding van de informatietheorie voor kwantumsystemen. 

Daar het formalisme een rechtstreekse veralgemening van de informatiemeet¬ 
kunde inhoudt, kan het gebruik er van ook in die discipline van nut zijn. De 
belangrijkste innovatie is hierbij dat het nieuwe formalisme toestaat de vol- 
ledige ruimte van kansverdelingen over een verzameling als model te nemen. 
Dit kan in de informatiemeetkunde niet op een zinvolle manier gebeuren aan- 
gezien de kansmaten daar noodzakelijkerwijs de rol van data vervullen. Deze 
verzameling tevens als model gebruiken zou het modelleringsproces bijgevolg 
redundant maken. Bijkomend wordt in dit proefschrift een eenvoudige tech- 
niek afgeleid voor het beantwoorden van de vraag of een statistisch model 
tot de exponentiele familie behoort. De eenvoud van deze werkwijze doet 
vermoeden dat zij reeds eerder door andere onderzoekers ontdekt is. Desal- 



niettemin ontbreken alle verwijzingen hiernaar in de informatiemeetkundige 
referentiewerken. Dit contrasteert met praktische nut dat zij biedt, zodat 
een uiteenzetting in dit werk gerechtvaardigd lijkt. 

De verhandeling vangt aan met een uiteenzetting van de basideeen van de 
differentiaalmeetkunde. De daar besproken begrippen vormen het wiskun- 
dig kader waaraan de rest van deze tekst wordt opgehangen. Eveneens werd 
een kort historisch overzicht van de belangrijkste ontwikkelingen in de in- 
formatiemeetkunde opgenomen. Deze twee delen vormen samen de inleiding 
(Hoofdstukken 2 en 3). De stand van het eigenlijke onderzoek wordt beschre- 
ven in Hoofdstuk 4. Het data set model formalisme wordt daarin in detail 
behandeld. Ook de meetkundige structuur wordt gedefinieerd en de studie 
van haar verband met het modelleringsproces vindt daar plaats. Het vijfde 
en laatste eigenlijke hoofdstuk richt zich tot een aantal toepassingen, dewelke 
vooral illustratief van aard zijn. 
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1. OPENING CHAPTER 


1.1 Situating the dissertation 

An often recurring class of problems in science takes the form of describing 
data by a simplified, parametrised model. Possibly the best-known approach 
in physics to this kind of modelling is through the method of maximum 
entropy nm. Its most common use is probably in the context of the ther¬ 
modynamic canonical ensemble. To apply this method one dehnes a suit¬ 
able entropy function and one or more measurable quantities (often called 
Hamiltonians or extensive variables) on the space of all probability distri¬ 
butions over the possible states of the system. Those distributions which 
predict the same values for each of the Hamiltonians are grouped together in 
subsets, each of which are represented by their element exhibiting the highest 
entropy—the so-called model distributions. To identify the model distribu¬ 
tion within such a subset is an optimisation problem with constraints, and 
the Lagrange multipliers appearing there serve also as labels for these dis¬ 
tributions. The task of modelling the experimental data is thus reduced to 
determining the values of the parameters consistent with the values of the 
extensive variables that are measured. 

While this dissertation does not contain a criticism of the maximum entropy 
method, a very different and less widely known approach to modelling is stud¬ 
ied herein. More specihcally, this work is concerned with the development of 
a framework meant to include a broad range of modelling problems and which 
is based entirely on geometric foundations: the data set model formalism. 
The geometry is derived from general divergence functions which quantify 
how well data is described by a given element of the model. An example of 
such a divergence function, found in statistics and statistical physics, is the 
relative entropy or Kullback-Leibler divergence M- 

The motivation to develop such an abstract geometrical theory comes from 
a criticism made against the mainstream formulation of information theory. 
This discipline is important to modelling problems as it is intimately related 
to the question of how to make meaningful statements regarding the quality 
of the modelling procedure. In particular the fact that information theory 
is based on probability theory has received critical attention from some an- 
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thors. This school of thought appears to have started in the work of Ingarden 
and Urbanik. They proposed an approach to information theory based on 
Boolean rings and explained their motivation for doing so as |6] 

. .. information seems intuitively a much simpler and more ele¬ 
mentary notion than that of probability . .. [it] represents a more 
primary step of knowledge than that of cognition of probability. 

Another proponent of this idea is none other than Andrej Kolmogorov, who 
asserted [7j 

Information theory must precede probability theory, and not be 
based on it. 

It must be remarked that the paper in which this statement can be found is 
devoted to arguing, amongst other things, that the basis of information the¬ 
ory must be combinatorial in nature. While a geometric theory does not have 
a close relation to combinatorics, the radically different viewpoint it offers, 
especially when sufficiently abstract, could present new and useful insights 
into the essence of information even if Kolmogorov should be dehnitively 
vindicated. 

Such insights of information without probability might also prove to be valu¬ 
able in physics. An outspoken advocate of the importance of information 
theory there was John Archibald Wheeler. He stated that [8] 

All things physical are information theoretic in origin . .. and in¬ 
formation gives rise to physics. 

One author to have made an attempt—be it a controversial one—to express 
Wheeler’s idea is Roy Frieden, who tries to base all of physics on an inform¬ 
ation theoretical foundation [9]. However, it is especially in quantum theory 
that Wheeler’s viewpoint has found many adherents. This is testihed by the 
numerous investigations into the question of whether or not it is possible 
to frame quantum theory in purely information-theoretic terms, see for ex¬ 
ample [IQHI 3 . The characterisation of information used by these researchers 
needs to take into account the laws governing microscopic physics, includ¬ 
ing the laws of probability. Due to the presence of incompatible observables 
however, quantum theory requires the notion of conditional probability to 
be discarded. This implies that quantum information theory must be appre¬ 
ciably different from it classical counterpart, despite being drafted in similar 
mathematical terms [T5lfT6] . One potential solution to this problem could 
lie precisely in the development of a sufficiently abstract formalism, such as 
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the one introduced here, simultaneously generalising both theories of inform¬ 
ation. 

Another argument from quantum physics is related to recent experimental 
advances in that held, such as those for which Serge Haroche and David 
Weinland were awarded the 2012 Nobel Prize in Physics. The possibility to 
perform so-called weak measurements [TTlfTS] has received much attention, 
not only in the experimental physics community but also from researchers 
interested in the foundations of quantum theory [T9l[20] and in quantum in¬ 
formation theory 121 . In such a measurement, information about the state 
of the system can be obtained without collapsing the wave function. Re¬ 
cent work even claims successful direct observations of the wave function 
itself through a combination of weak measurements and ordinary project¬ 
ive measurements [221123] — a feat considered impossible by the Copenhagen 
interpretation of quantum mechanics [23]. An adaptation of quantum in¬ 
formation theory in order to accommodate these hndings may therefore be 
required. The availability of an abstract and thereby flexible framework of 
information would likely be regarded as a boon for those researchers working 
out the details of such a transition. 

Even when no changes to quantum information theory are shown to be 
needed, the contents of this dissertation may still prove to be useful to 
that held of study. Newly obtained results—at the time of writing still 
unpublished—making use of the data set model formalism indicate it may 
be possible to simultaneously simplify and generalise existing results such as 
those of Petz on positive-operator valued measures [T6] . 

The data set model formalism is for a large part a generalisation of the res¬ 
ults of information geometry. This discipline—which is to be introduced 
more elaborately in a later chapter—is a differential geometric framework 
for probability theory and statistical models. The generalisation performed 
in this research consists mainly in removing the limitation to these topics, 
leading to a very general picture of the modelling process. While such an 
approach could have the disadvantage of leaving an impression of abstrac¬ 
tion and technicality on the reader, it is also believed to be advantageous 
in the long term. Indeed, it is precisely by exercising this increased level of 
abstraction that it is hoped that the geometric essence of information will be 
laid bare without referring to any context-specihc properties. 

Some additionally obtained results may be of use to researchers interested in 
statistics. In particular, the construction of the data set model geometry of¬ 
fers a convenient method to establish whether or not a parametrised family of 
probability distributions belongs to the exponential family. In this case, the 
geometric structure also facilitates the search for the canonical parameters 
of the family. While the simplicity of this method suggests that it is not an 
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original finding, reference thereto seems to be missing from the information 
geometry literature. 

Furthermore, a varied range of potential applications is expected in the longer 
term. This supposition is based on a number of advantageous innovations. 
Perhaps the most prominent of these is that the parametrised family of mod¬ 
els is no longer required to be a subset of the data. It is even allowed for 
models to be mathematical objects qualitatively different from those data. 
This is to be contrasted with both the maximum entropy method and with 
information geometry. For this reason, a possibly less obvious but still rather 
promising held for applications is that of machine learning. This is a very 
broad area of ongoing research related to artihcial intelligence, data min¬ 
ing and other methods for information processing which have dramatically 
acquired importance over the last few years. A list of example problems 
includes curve htting or the estimation of probability density but also ap¬ 
plications less—or less obviously—related to parameter estimation such as 
hngerprint recognition, Google’s page rank algorithm, automatised transla¬ 
tion and many more. (See for example [251I2S] for an introduction to this 
very rich discipline.) 

1.2 Structure of the dissertation 

This opening chapter has summarised the results of the research upon which 
this dissertation reports and it has sketched the broader context in which 
these are to be seen. For clarity, a brief explanation of the conventions and 
notations used in the rest of the text follows shortly hereafter. 

The following two chapters form the introduction. Chapter 2 gives a short 
overview of the basic concepts of differential geometry which are part of this 
dissertation’s lexicon: manifolds, vectors, differential forms, the Riemannian 
metric, as well as affine connections and their curvature. These notions will 
provide the mathematical language in which the rest of the dissertation is 
framed. Chapter 3 introduces information geometry along the lines of its 
historical development. This chapter also contains a brief introduction to 
divergence functions as they appear in literature and a short overview of 
applications of differential geometry in thermodynamics. The main chapter 
of this dissertation is found fourth and therein the theoretical aspects of the 
data set model formalism are elucidated. The basic elements and assump¬ 
tions are explained, a geometrical structure for the models is erected and 
the essential properties of this structure are studied. Chapters 2-4 share a 
similar internal structure as to make the analogies between them more clear. 
Selected examples and applications of the formalism are found in the hfth 
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chapter for the purpose of illustration. The end of this dissertation is formed 
by the conclusion and an outlook of possible future research in this context, 
as well as a curriculum vitae of the author and the bibliography. 

1.3 Conventions and notations 

There are some conventions and notations which are used throughout this 
dissertation. A fairly large part of these are expected to be already known 
or intuitively clear to the reader. Nevertheless, a short summary of these is 
presented here. 

As this dissertation makes elaborate use of differential geometry, one of the 
most useful conventions is that of Einstein summation. This convention 
states that when an index appears twice within the same term of an equation, 
this index is implied to be summed over all of its values. In such pairs, the 
index will always appear once as an upper index and once as a lower index. 
Unpaired indices must necessarily appear in every term of an equation. It is 
conventional for this to mean that the equation holds for all possible values 
of the unpaired index. For example, it is possible to write down the affine 
transformation of the vector v into w by the matrix A and the vector b in 
terms of components as 

+ U 

rather than the equivalent traditional, bnt much more cumbersome, notation 

n 

tc* = ^ CjV^ + 6* Vi G {1... n}. 
i=i 

The convention for nnpaired indices—the expression in which they appear 
holds for all valnes the index may take—is extended to all qnantities and the 
symbols representing them. Whenever an expression makes use of a symbol 
which is not specified nniqnely, the expression is assnmed to hold for every 
possible concrete object the symbol could represent given any restrictions 
that may apply. This may sonnd like a complicated and technical way of 
introdncing qnantities. Should the reader be left with this feeling, he or she 
is reminded that this same convention has already been invoked above when 
introdncing the matrix A and the vectors v and b. A similar convention is 
well-known from expressions for fnnctions bnt there it is traditionally ob- 
scnred by speaking abont “variables”. However, the same resnlt is obtained 
by agreeing that—for example—the expression 


f{x) = X ln(a;) 
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holds for all possible real numbers x in the domain of the function /. This 
convention is useful to avoid what can sometimes become long enumerations 
of dehnitions mostly obvious from context, as these have a habit of making 
mathematical texts more tedious and less pleasant to read than really neces¬ 
sary. 

Another situation where a more concise notation is used than the one the 
reader may be familiar with is partial derivatives with respect to parameters 
or coordinates. From Chapter 4 onwards, very general divergence functions, 
denoted by the letter D, will be used. These are functions taking a pair 
{x, me) of arguments and mapping the pair to a real number D{x\\me). Since 
the second argument will be an element of a manifold and can be endowed 
with coordinates, it is possible to differentiate these functions with respect 
to the coordinates. The traditional notation for such a derivative, evaluated 
in the point with coordinates 6, would look like 

Since such expressions will appear frequently, it is preferable to use a more 
concise notation. Therefore, the quantity above will also be represented by 
the expression resembling the one for functions of the parameters only, 

dkD{x\\me). 

For readers who find this notation confusing, it may be useful to keep in 
mind the analogy with the commonly used expression f'{a), which denotes 
evaluation of the derivative of a function / in the point a, avoiding the use 
of an auxiliary variable. The traditional notation will still be used where 
ambiguity could arise from using the shorter alternative. 

When specific notations are introduced on the fly, this will be denoted by 
using the symbol °=' as the equality sign—not to be confused with the in¬ 
equality 7 ^. A similar symbol '=' is used to indicate that the equality is in 
fact also the definition of the quantity being introduced. 

A hnal convention has to do with measures over sets. Sometimes it is ne¬ 
cessary to sum or integrate over such a set or a subset thereof. This will 
appear in a number of expressions originating in statistics, for example ex¬ 
pressions for expectation values. For this the integration sign shall be used, 
even when the measurable set over which the integration takes place may be 
discrete. Unless explicitly stated otherwise, da; represents the measure in the 
integral—whereas in the literature it would be common to use a notation like 
dp{x) instead. 




2. ELEMENTARY DIEEERENTIAL GEOMETRY 


This chapter is the hrst of two introductory chapters. As such, it will provide 
a quick introduction into differential geometry. This branch of mathematics 
will provide the mathematical framework for this dissertation. Before com¬ 
mencing the introduction of this subject in earnest, it should be noted that 
differential geometry is an exceptionally interesting and rich field. As such 
no brief introduction could possibly do it justice. Therefore this discussion 
is limited to those concepts strictly necessary for the understanding of this 
dissertation. This list of topics, which will also serve as a backbone for the 
following two chapters, includes manifolds, coordinate functions, the metric 
tensor and the affine connection. 

Readers interested in discovering more of this most elegant held are referred 
to introductory works such as [27] and [2H|, upon which this introduction 
has loosely been based and which serve as the main reference material. The 
book by Pressley is dedicated to the “extrinsic” variant of differential geo¬ 
metry, which means it studies curves and surfaces embedded in a larger 
space. Frankel’s book on the other hand, devotes most of its attention to the 
“intrinsic” theory, which can be formulated independently of a containing 
space and which is also closer to the differential geometry applied in this 
dissertation. Readers with an interest in information geometry, which will 
be discussed in the next chapter, may also hnd the hrst chapters of Amari’s 
books [221 and [30] interesting introductions, although they are far less broad 
in scope than more dedicated works. 

2.1 The origins of differential geometry 

Differential geometry is a held of study opened up by Carl Friedrich Gauss ori¬ 
ginally concerned with curves and surfaces embedded in Euclidean space. Its 
scope was greatly expanded after the contributions of the Hungarian math¬ 
ematician Janos Bolyai and the Russian mathematician Nikolai Ivanovich 
Lobachevsky, who discovered the solution to a long standing problem [3T] . 
Over the course of two millennia, scores of mathematicians were plagued by 
the nature of Euclid’s so-called Parallel Postulate [32], which in the Playfair 
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formulation read^ 

In a plane, given a line A and a point p not on it, at most one 
line B parallel to the given line A can he drawn through p. 


A 



It was generally believed that this axiom is a consequence of the other ax¬ 
ioms of Euclidean geometry, even though no one was able to demonstrate this. 
Bolyai and Lobachevsky independently provided the verdict by showing the 
existence of mathematically consistent geometries which did not satisfy Euc¬ 
lid’s Parallel Postulate. To achieve this, they altered the axiom as to demand 
that more than one straight line through p can be constructed which is par¬ 
allel to A. The resulting geometries are known as hyperbolic geometries. 
Two decades later, the German mathematician Georg Bernhard Riemann de¬ 
veloped what is now known as Riemannian geometry [33]. This is probably 
the best-known class of non-Euclidean geometries in the physics community 
as the mathematical framework of Einstein’s general theory of relativity is 
based upon Riemann’s ideas. Through the collaboration of mathematicians 
such as Elie Gartan, Henri Poincare and others, the study of Riemannian geo¬ 
metry eventually led to the mathematical branch now known as differential 
geometry, a very general framework for describing the geometry of spaces. 

2.2 Manifolds and (co)tangent spaces 

The stages upon which differential geometry takes place, that is the spaces 
of which the geometry is studied, are manifolds. Manifolds will be denoted 
by the double letter M. These are sets in which every element p, usually 
called a point, has a neighbourhood Afp which allows for the definition of 
coordinates for the points in that neighbourhood. A coordinate function (p 
is a homeomorphism ip : Mp —)• M", that is it associates with every point an 
ordered set of n real numbers. 


^ It should be noted that this formulation is only correct in the presence of the four 
preceding axioms of Euclidean geometry. In the current context, however, this formulation 
suffices and it has the advantage of simplicity. 
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These numbers {(y9*(p)} are called the coordinates of the point p. A homeo- 
morphism is a bijective map which is continuous and which has a continuous 
inverse [31]. It is important for <^9 to be a homeomorphism and not just 
any map. The invertibility means that ip endows only a single point of its 
domain with a specific set of coordinates. The continuity of the coordinate 
map means that points close together in M are associated with coordinates 
which are also close together in M". This ensures that the topological struc¬ 
ture of M is faithfully reflected in the coordinates. The natural number n is 
called the dimension of the manifold. For practical reasons the dimension is 
assumed to be hnite, although many interesting properties and applications 
do exist in situations where the dimension is inhnite, see for instance [351136] . 
Many mathematical sets encountered in physics are manifolds, even though 
they are not always identihed as such. The space in which classical physics 
takes place is a manifold and the coordinate maps provide the familiar co¬ 
ordinates of points. The same thing is true for space-time as it appears in 
the theory of relativity. Most of the terminology in the theory of manifolds 
is actually derived from the analogy with physical space. Other examples of 
manifolds include, although care should be taken in the sense that not all of 
these can be covered by a single coordinate function, the configuration space 
of classical mechanics and the state space of statistical physics, as well as all 
hnite-dimensional vector spaces and—by extension—Hilbert spaces, which 
are widely known as an essential element of modern quantum theory. 

Two very important structures which can be defined on manifolds and used 
in describing the geometry of a space are the metric tensor and affine connec¬ 
tions. In order to elucidate these concepts, it is necessary to first introduce 
a few more elementary concepts such as vectors and differential forms. 

After the manifold, the most basic object of differential geometry is the vec¬ 
tor. Perhaps the physically most intuitive interpretation is derived from 
velocity vectors. Consider an observer who measures the value of a function 
/ everywhere on her path and who also keeps time in order to tabulate the 
measured value as a function of time. This observer can then compute the 
rate of change in the value of / she observed with respect to the time passed. 
It is possible to dehne the velocity vector of this observer as a functional 
which returns precisely this rate of change when applied to the function /. 
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More rigorously, when a coordinate function ip is fixed, then a vector n at a 
point p is a linear operator on real-valued functions / : M —)■ M satisfying 


f e^v{f) =n*^(/o(p 1) 


v{p) 


( 2 . 1 ) 


for all functions / for which these derivatives exist. The numbers n® are 
called the components of v. In the case of velocity vectors these components 
equal the derivatives of the coordinate functions with respect to time. 

Even though the dehnition fl2.ll) depends on the choice of a coordinate func¬ 
tion p, it can be shown that the vector itself is indeed invariant under suitable 
coordinate transformations. In particular, if a second set of coordinates {C“} 
is employed, these must be a diffeomorphic function, that is a smooth func¬ 
tion with a smooth inverse, of the original coordinates {9?*}. In such case, 
the chain rule of calculus implies the derivatives relate to each other as 


d dp'’ d 
dp^ ’ 


where the hrst factor on the right hand side represents the components of 
the Jacobian matrix of the transformation. If the coordinate independence 
of the vector v = v^di = v°‘da is to be respected, this means the components 
must transform according to the inverse transformation, that is 


V 


a 



That this is satished is also a consequence of the chain rule, as it is equivalent 
to the expression 


dC“ d(p® 
dt dp'^ dt 

As was hinted at before, the most common example of a vector is the velocity 
vector, both in classical mechanics and in the theory of relativity. Acceler¬ 
ation is also described as a vector. Many other quantities are described as 
vectors in physics, such as linear and angular momentum, force, torque... 
Most of these, however, including the four listed, appear more naturally in 
modern theories as differential forms [28]. Those objects and their opera¬ 
tions can be framed as vectors but to do this requires a few pieces of extra 
mathematical machinery, not all of which are contained in the scope of this 
introduction. It is important to remark that the particular dehnition of vec¬ 
tors used here is more restricted than the algebraic dehnition as simply an 
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element of a vector space. This particular kind of vectors are properly called 
tangent vectors. Another type of vectors in the algebraic sense to be dis¬ 
cussed in this dissertation are the differential forms with which this section 
will be concluded. Other, more general, vector structures will not play a 
major role in this dissertation and will therefore not be given further consid¬ 
eration in this introduction. 

It is possible to verify that all tangent vectors with the same point of applic¬ 
ation p form a vector space when endowed with componentwise addition and 
scalar multiplication. This follows from the fact that linear combinations of 
such vectors are still of the form fl2.ip . This vector space is called the tangent 
space at the point p, or T^M for short. Together, all vectors over a manifold 
constitute the tangent bundl^, abbreviated TM. The tangent bundle can 
also be seen as a manifold—the conhguration space of mechanics is probably 
the best known tangent bundle—but it is not a vector space as addition is 
only dehned for pairs of vectors which share a point of application. The oper¬ 
ations obtained by applying partial derivative operators {di} and evaluating 
the result at a point p form the basis vectors of the tangent space TpM. 

A particularly interesting type of subsets of the tangent bundle are vector 
helds. A vector held can be likened to a function F : M —)■ TM but with 
the additional restriction that the function value of a point p must be an 
element of T^M and not of another tangent space. The traditional but more 
technical formulation states that a map F : M —)• TM is a vector held when 
TT o F, where tt is the bundle projection map associating with every vector 
its point of application, is the identity mapping on the domain of F. The 
basis vectors employed thus far, the diherential operators {di}, naturally 
form n = dim(M) linearly independent vector helds. Such a collection of 
vector helds which form a basis for each tangent space (in their domain) is 
called a frame. It is not required that a frame consists out of diherential 
operators with respect to coordinate functions but this dissertation will ex¬ 
clusively employ such so-called coordinate frames as they are probably the 
most familiar to readers. 

As the tangent spaces T^M are vector spaces, it is meaningful to speak of 
the dual vector space: the space of all linear functionals on T^M with values 
in the real numbers. Where tangent vectors are probably most familiar to 
readers as column vectors, their duals are traditionally denoted—and bet¬ 
ter known—as row vectors. The dual space of T^M is called the cotangent 
space at p and is denoted in a concise way as T*M. Together these cotangent 
spaces make up the cotangent bundle T*M. Smooth helds over this bundle 
are called (diherential) 1-forms. Phase space as it appears in Hamiltonian 


^ Some sources such as m reserve this name for the bundle projection map. 
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mechanics is probably the most familiar example of a cotangent bundle, even 
though it is not usually introduced as such in physics textbooks. 

Just as a canonical basis {di} of partial derivative operators exists for T^M 
given a coordinate function ip, a canonical basis {cr*} exists for T*M given 
this same coordinate function. The functionals making up this dual basis 
satisfy 


a^{di) =■ (2.3) 

A most elegant way of introducing the canonical dual basis is through the 
exterior derivative operator, for which the symbol “d” is used. This oper¬ 
ator maps its arguments, which are differentiable functions, into differential 
1-forms. More in particular the exterior derivative d/ of a scalar function / 
acts on a vector v as 


(d/) (if) =%!/). 

By choosing / = (p*, the component of the coordinate function, it follows 
that 




v{p) 


= ^r 


This, combined with the linearity of these operators, implies cr* = d(p* for the 
duals of the basis {dj}. It is often useful to choose the 1-forms {d(p*} as the 
basis for the cotangent spaces, even though it is not strictly necessary to do 
so. One advantage of this choice is that the exterior derivative of a function 
/ can easily be expressed as 


d/ = {d,f) d(p\ (2.4) 

This expression also shows there is a connection between the exterior de¬ 
rivative of a function and its gradient. The numbers dif are equal to the 
components of the gradient of / in Cartesian coordinates but not in an ar¬ 
bitrary curvilinear coordinate system. Expression fl2.4p however, holds in all 
coordinate systems, which offers obvious practical advantages. 

The 1-forms introduced allow for the construction of a very rich algebraic 
structure. They can be endowed with a product, denoted as A, which is fully 
antisymmetric. More specihcally, given two 1-forms a and fd, the 2-form aA/3 
is a bilinear mapping dehned by 


(a A (3){v, w) = a{y)(3{w) — a{w)l3{y). 
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This operation has many uses in differential geometry, including the dehnition 
of the exterior derivative of 1-forms. Given a 1-form a = where 

= a{di), its exterior derivative equals 

da '^= (da*) A d^?* = (djai) d(p^ A d^?*. 

From the skew symmetry of dip^Adip^, it automatically follows that d(d/) = 0 
for all real-valued functions /. A very useful property is the (partial) con¬ 
verse to this, which is called Poincare’s lemma. It states that, on a manifold 
which can be contracted into a single point by a continuous transformation, 
the vanishing of a form’s exterior derivative, i. e. da = 0, is a sufficient con¬ 
dition for the existence of a function / such that a = df. Though not all 
manifolds in this thesis are indeed contractible, it is nevertheless often pos¬ 
sible to apply Poincare’s lemma locally even when it does not hold globally. 
Such a function / is called a potential for the 1-form a. 

This structure of differential forms can be extended to dehne higher forms 
and their exterior derivatives as well. This is done in a completely analogous 
fashion. The use of forms in this dissertation will remain fairly limited to 
accommodate readers more familiar with the index-heavy tensorial notation 
traditionally used in the physics literature. As a consequence forms or ar¬ 
bitrary degree will not be treated in this introduction. Nevertheless, some 
results involving higher forms than those of hrst degree will be required in 
some situations. One example of such a result is the analogue of the Leibniz 
rule for the wedge product of two 1-forms a and f3: 

d(a A (3) = (da) A (3 — a A (d/9). 

Also Poincare’s lemma will be used for higher forms. Its generalisation is 
straightforward: on a contractible manifold any differential p-form of which 
the exterior derivative vanishes can be written as the exterior derivative of 
some differential (p — l)-form. 

2.3 The Riemannian metric 

So far this introduction to differential geometry has not yet included any 
mention of distance or angles, arguably both important concepts in any study 
of geometry. In order to introduce these quantities, one must choose an 
inner product for each of the tangent spaces. Such a mapping is called a 
metric tensor and in particular, since it is dehned on the tangent spaces, a 
Riemannian metric. This mapping is denoted by g and so it can be written 
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thalEI 


TpMx TpM^M. 

An inner product must be symmetric, which means that g{v,w) = g{w,v) 
for all vectors v and w in the same tangent space. Another requirement is 
the positive dehniteness, which means g{v,v) ^ 0 for all vectors v and that 
the value can only be equal to zero when all the vector components equal 
zero. Often the value of the inner product will be written in its coordinate 
representation g{v,w) = gijV^w^, where the components are given by 

9ij = 9{di,dj). (2.5) 

The fact that this notation is possible follows from the property of bilinearity, 
which means linearity holds in both arguments separately. 

Lengths of vectors in T^M and angles between two vectors in this space can 
be computed using the metric through the respective familiar relations 

V = \/g{v,v)\p and cos(0) = 

V vw 

Readers will no doubt be familiar with the dot product on but this does 
not necessarily paint a representative image of Riemannian metrics. This is 
because a large part part of the physics literature describes space also as R^, 
which itself can also be seen as a vector space. Elementary textbooks tradi¬ 
tionally make use of this extra structure when introducing vector calculus. 
In particular, identifying all tangent spaces with each other conceals that the 
dot product is actually dehned on each tangent space separately. The best 
known examples of non-trivial Riemannian metrics are probably the solu¬ 
tions to the held equations which lie at the heart of Einstein’s general theory 
of relativity. The Fisher information matrix [38] appearing in statistics, for 
example in the Cramer-Rao inequality [39|, is used as a metric tensor in the 
held of information geometry. This will be discussed in more detail in the 
next chapter. 


2.4 Connections and curvature 

A similarly important structure on a manifold is the affine connection. The 
connection may seem a more technical notion than the metric but it has many 

^ Where context avoids confusion, the index p from this notation will not be mentioned 
explicitly. Alternatively, this definition can be expanded to map pairs of vector fields into 
real-valued functions over M. 
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practical applications nonetheless. The most common way to introdnce a 
connection on a tangent bnndle is throngh a covariant derivative operation, 
denoted by V. Althongh strictly speaking an abnse of notation, this symbol 
is often nsed to refer to the connection as well. Jnst as derivatives of fnnctions 
play a role in many branches of mathematics and science, a similar notion 
is often very nsefnl for vector helds. Consider the problem of dehning the 
directional derivative of a vector held X in the direction of the vector v G 
TpM. It is possible to constrnct snch a derivative and the resnlting vector 
is denoted by V^X. However, some difhcnlty is involved in working ont this 
notion so it is instrnctive to hrst take a look at the easy aspects and leave 
the hard part for last. 

A hrst property that the covariant derivative mnst satisfy is linearity in its 
hrst argnment—^jnst like the more familiar directional derivative of scalar 
fnnctions. This means 


=*• n*ViX. 

Fnrthermore, the covariant derivative shonld be linear for addition in the 
second argnment and satisfy the Leibniz rnle 

V.K/X) = n(/)X + /V^rX, (2.6) 

where / is an arbitrary diherentiable fnnction. This property—combined 
with the expansion of the vector held X = X*clj—states that 

V,-X = v{X^)d, + X^X^di. (2.7) 

This shows it snfhces to determine the covariant derivatives of the basis 
vectors—here of conrse seen as helds. Becanse of the hrst linearity property, 
it is even enongh to know at every point p the coefficients of the expansion 

Xidj = uj’^ijdk- ( 2 . 8 ) 

These nnmbers are called the coefficients of the connection at p. There is 
no obvions way to determine these nnmbers in all contexts, something which 
is also the case with the metric tensor. In fact, the whole point of Einstein’s 
general theory of relativity is that the geometric properties of spacetime fol¬ 
low from physical laws rather than from mathematical principles! The con¬ 
nections playing an important role in information geometry are also dehned 
from context-specihc argnments. This is elncidated in the next chapter. 

It is interesting to note that the second term in the right hand side of eqnation 
02.71) is essentially a linear transformation of the vector X (at p). This is even 
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more apparent when leaving the hrst argument of the covariant derivative 
unmentioned: 


S/X = di® [dX* + 


where the expression between square brackets is an element of T*M and the 
connection 1-forms 


i i 1 fc 

00 j — 00 kjO-(p 


(2.9) 


were implicitly introduced. This notation of connection forms turns out to be 
very convenient in some formulas. Against the general trend in this disserta¬ 
tion of limiting the use of forms, this notation will appear in some places. For 
example the proof of Riemann’s theorem following shortly would be much 
more cumbersome when making use of the more usual notations introduced 
in equations fl2.7p and fl2.8p . 

Apart from yielding a dehnition for the derivative of a vector held, a connec¬ 
tion can also be used to dehne whether or not two vectors at different points 
are parallel. In order to do this, the concept of parallel transport must hrst 
be dehned. Given a curve C : [tj, t/] —t M, it is said a vector held F, dehned 
(at least) on the curve C with tangent vector held X = ^, is parallel along 
C or has undergone parallel transport along C if 


0 = VyF 

= dj®X%Y^ +oo\kY'^] 



This gives a way to transport vectors from one tangent space to another: 
solving these n ordinary hrst order diherential equations yields the compon¬ 
ents of the vector held everywhere along the curve (parametrised by t). In 
the drawing this is demonstrated for a toy connection. The tangent vector 


held X to the curve (in grey) is also shown. 



In general it should be noted that transporting a vector over diherent curves 
with identical end points need not yield the same result. Equivalently, par¬ 
allel transport of a vector around a closed loop will in general not make the 
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final vector coincide with the hrst one. The difference between the original 
vector and a vector transported around a closed loop can be quantihed using 
the curvature of the connection. If the curvature vanishes, the connection 
is said to be flat and this is equivalent to saying that parallel transport is 
independent of the chosen curve. 

Most introductory texts introduce curvature through the commutativity of 
covariant derivatives, which is equivalent with the vanishing of the curvature. 
While this is correct, such derivation usually does not include the reasoning 
leading up to the most interesting result, which is the existence of covariant 
constant vector helds. A vector held X is said to be covariant constant if 
parallel transport of the value of the vector held at a point p to a point q 
is equal to the value of the held at q. This dehnition requires the parallel 
transport to be independent of the path and so the vanishing of the connec¬ 
tion is a necessary condition for covariant constant vector helds. The next 
two pages will deal with showing that this is also a sufficient condition by 
proving the slightly stronger result known as Riemann’s theorem. 

The best-known property of manifolds endowed with a hat connection is that 
they allow for global Cartesian coordinates. However, the connection may 
arbitrary. Rather it must be the metric connection of Levi-Civita, which has 
coefficients given by 

= T^ij =^' {digsj + djgis - dsgij ), ( 2 . 10 ) 

where the numbers g’'^ are the components of the matrix inverse of the met¬ 
ric tensor. It will be shown that when this particular connection exhibits 
vanishing curvature, there exists on the manifold a local set of coordinates 
in which the metric tensor everywhere takes the Euclidean form 

n 

g{v,w) = (2-11) 

i=i 

The property that the metric tensor can be written in this form when the 
metric connection 02 .101) is flat, is Riemann’s theorem. Note that this is not a 
trivial statement. The metric can take this form at any point by choosing an 
appropriate set of coordinates called normal coordinates. However, nothing 
guarantees that this form will be valid also outside this one point while still 
using the same set of coordinates. It is only when the curvature vanishes 
that coordinates exist in which this property holds everywhere. 

First, it will be shown that for any connection the parallel displacement of 
vector helds is independent of the path followed if and only if its curvature 
vanishes. Only then will it be shown how this gives rise to the Euclidean 
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form of the metric tensor. 

A vector field X is said to be parallel along a curve C parametrised by t if 
its covariant derivative vanishes along that curve or, equivalently, 


0 = ® 

1-1 

+ 

1_1 

= di® 

-4a 

+ 

“A 

1_1 

= di^[dX^ + u;\X’^] 



( 2 . 12 ) 


From the condition fl2.12p for the vanishing of the covariant derivative it 
can be seen that parallel transport yields the same result independent of the 
curve C if and only if 


dX^ + u\X^ = 0 (2.13) 

since this left hand side is the curve-independent part of the condition. 
Frobenius’ theorem states that a sufficient condition for the equations fl2.13p 
to have a solution is then for the 2-forms 

d(dX' + u\X'^) 

to vanish [28]. It is possible to rewrite, using 02.1311 again in the second 
equality 

d(dX* + u\X'^) = d^X* + {du\)X'^ + dX^ A 
= {duj^j)X^ + i-uj'^jX^) A u:\ 

= {du^j + oj\ A oj^j)XE 

This means parallel transport of any vector held X is path-independent when 
the curvature 2-forms 


Vl}k — dcii^fc-|-A (2.14) 

vanish identically, which concludes the proof of the hrst statement made 
above. In a more index-heavy notation, this can be expressed as 

0 kij jk ik d~ CU is^ jk ^ js^ ik- 

When the curvature vanishes, parallel transport is path independent and so 
covariant constant vector helds can be constructed by parallel transporting 
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a vector at a single point to other points of M. 

The proof of Riemann’s theorem is continued by showing that these covariant 
constant vector helds are partial derivatives with respect to some coordin¬ 
ate functions 0*. This is useful since in these coordinates, the connection 
coefficients 

= de’^iv^dj) 

vanish identically. Such coordinates are known as the affine coordinates of 
the connection. Although the proof is not in itself very difficult, it requires 
some new concepts and properties, the introduction of which would burden 
the reader with more text than can be justihed from their use in this disser¬ 
tation. Interested readers are thus referred to the source material mentioned 
earlier for the details of this part of the proof. 

It should be stressed that the results up to now hold for all flat affine con¬ 
nections. Riemann’s theorem deals with the specihc case of the Levi-Civita 
connection fl2.10l) . The interesting property of this connection is that it sat- 
ishes 


X{g{Y, Z)) = g(Vj^V, Z) + g(V,V^Z) (2.15) 

and so it preserves inner products between vector helds under parallel trans¬ 
port. But this means that when one chooses an orthonormal basis at one 
point of M, parallel transport through the Levi-Civita connection will result 
in vector helds which are orthonormal everywhere. The drawing illustrates 
this for transport along a curve in the plane. 



By the above, the vector helds obtained in this way are partial derivatives 
with respect to some coordinates {0^}- In those coordinates, the metric will 
thus take its Euclidean form fl2.11jl . This concludes the proof of Riemann’s 
theorem. 

Perhaps the best known application of connections consists in the construc¬ 
tion of geodesics. Most people are familiar with geodesics as the shortest 
paths connecting two given points but this is not generally true as this is 
a particular property exclusive to the Levi-Civita connection. In general a 
geodesic is a curve to which the tangent vectors form a covariant constant 
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vector field. Since such vector fields are said to be parallel to themselves, the 
geodesics are the natural generalisation of straight lines. 

The curvature as measured by the 2-forms is the intrinsic curvature and, 
if applicable, is independent of the way the manifold is embedded in a lar¬ 
ger manifold. As such it must be contrasted with extrinsic or embedding 
curvature, which depends on the way the manifold is embedded in a lar¬ 
ger space and in particular on how the normal vector helds on the manifold 
behave as one moves over the manifold. An entire theory of embedding 
curvature can be set up and a short but clear summary can be found in 
Chapter 2 of Amari’s book [29]. Also Pressley’s book [27] concerns itself 
with embedding curvature but it is limited to three dimensions and it uses 
the traditional notation, making it perhaps more difficult to see how to set up 
such a theory in an arbitrary number of dimensions. The difference between 
intrinsic and extrinsic curvature will become an important point later in this 
introduction, in particular in Chapter 3. 

To draw an intuitive picture of the difference between the two kinds of 
curvature, some examples might be useful to keep in mind. A curve, which is 
a one-dimensional manifold, cannot have intrinsic curvature. This is because 
all 2-forms, including the curvature 2-forms Q^k, vanish on one-dimensional 
manifolds by the antisymmetry of the wedge product. A curve can have ex¬ 
trinsic curvature, however, and this is the case when a tangent vector held 
must change direction as one moves over the curve. This extrinsic curvature 
is measured by the inverse of the radius of curvature as it is known from more 
elementary texts. A sphere, on the other hand, is intrinsically curved. This 
means that when the sphere is embedded in a hat space like M”, the normal 
vectors to the sphere must necessarily change direction as one moves over the 
sphere and so the embedding curvature will not vanish globally, even though 
it may do so locally should the sphere be deformed appropriately. 

In this dissertation several affine connections will appear and these will not 
all be metric connections. However, it is still possible to obtain interesting 
properties of affine connections beyond Riemann’s famous result. In partic¬ 
ular, it is possible to dehne the notion of dual connections through a relation 
generalising fl2.15p . When two connections V and V* satisfy 

Xg{Y, Z) = g{V^Y, Z) + g{Y, V^Z), (2.16) 

again for all vector helds X, Y and Z, it is said these connections are dual 
with respect to the metric tensor g. Every connection V has a dual connec¬ 
tion as expression fl2.16l) can be used as a dehnition of V*. Parallel trans¬ 
port of one vector held through a connection V and of the other vector 
held through the dual connection V* will always preserve the inner product 
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between the vector fields. The drawing illustrates this for a pair of orthogonal 
vector fields in the plane transported over a curve, each through a different 
connection. Note that neither of these connections preserves the length of the 
vectors under parallel transport. Since the original vectors are orthogonal, 
their transported counterparts are as well. In general the angle between the 
vectors need not be preserved either. 



Due to the symmetry of the metric, it holds that (V*)* = V and it is possible 
to show that a pair of dual connections have curvatures which cannot vanish 
independently of each other [29]. Such pairs of dual connections play an 
important role in information geometry but they are not so common in most 
applications of differential geometry. Therefore most introductory texts on 
differential geometry do not cover this topic. 

Before finally concluding this introduction to differential geometry, it is im¬ 
portant to devote attention to a last application of covariant derivatives. The 
previous seven pages considered the covariant derivative of a vector. At least 
as important is the notion of the covariant derivative of a 1-form and the 
resulting definition of the Hessian.Remember that the Leibniz rule for a co¬ 
variant derivative, equation fl2.6p . implicitly made use of the property that 
the covariant derivative of a function is the same as the regular derivative, 
that is 


= v(f). 

One way to construct a differentiable scalar function is to apply a 1-form to a 
vector field—both of which must have differentiable components. This leads 
to the scalar function / = a{X) = atX^. This form of the expression hints 
that a differentiation of / can be performed using a variation on the Leibniz 
rule. 


V.-(/) = V.-(a(X)) = {X^a)iX^ + atiX^Xf. 

The presence of a lower index on shows that this quantity itself is also 
a 1-form, just as the covariant derivative of a vector field is again a vector 
field. Rearranging the above line and working out the resulting expression 
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yields 

(V„-a)(X) = V„-(a(X)) - a(V„-X) 

= ffdiiajX^) - akiffdiX^ + u\jffX^) 

= v^[diaj - uj^ijak]XE 

Since this must hold for any vector held X, the covariant derivative of a 
1-form is given by 

= v''[diaj - ijak]dLpE 

The covariant derivative has the same function for 1-forms as for vectors: it 
performs a directional derivative in such a way that the result is independent 
of the chosen basis for the (co)tangent spaces. This is particularly useful 
when speaking about the Hessian in the context of manifolds. The Hessian 
is the generalisation of the second derivative of a scalar function. However, 
computing higher derivatives must take into account that not only the func¬ 
tion is liable to change from point to point. Also the direction in which a 
partial derivative quantihes the change of a function need not remain the 
same. This change of direction is exactly what the connection coefficients 
express. Therefore, the Hessian of a function / is dehned as 

Hess(/)(M,n) =^' (V^d/)(n). 

Note that though Hess(/) is a function taking two vectors as arguments and 
mapping them into a real number, it is not a 2-form. If / is continuously 
differentiable then Hess(/) is a symmetric mapping—whereas a 2-form would 
be antisymmetric. To stress the symmetry of the Hessian, a more commonly 
used notation for this object is 

Hess(/)(a„a,) = V,V,/ 

= didjf - uj^ijdkf. (2.17) 

It is this notation that will be used throughout the other chapters. 

As is stated in the first paragraphs of this chapter, a complete introduction to 
differential geometry would include a large amount of material not present in 
this introduction. In particular many theorems and properties are mentioned 
but not treated explicitly. Some of those theorems, such as the result on 
integrability that bears Frobenius’ name and which is used as a step in the 
above proof of Riemann’s theorem, are also important in other parts of this 
dissertation. Where such theorems are invoked they shall be mentioned when 
possible. For their proofs and the required background, the reader is again 
referred to introductory material. 




3. INFORMATION GEOMETRY 


The largest source of inspiration for the work in this dissertation is the math¬ 
ematical held of information geometry. In this part of the introduction, a 
brief overview of this discipline will be presented. The text of this chapter is 
based upon Amari’s books |29l|30], supplemented with elements of landmark 
articles in this held of research. 

3.1 Parametrised statistical models 

Information geometry is the study of statistics through the methods of dif¬ 
ferential geometry. This can happen either through the use of known results 
from diherential geometry to expand the existing knowledge of statistics or by 
employing geometric methods to facilitate computations in statistical prob¬ 
lems. Some examples of such applications are found in [351110] . Of particular 
interest for this dissertation is the problem of parameter estimation. This 
problem occurs whenever one has quantitative data, which can be obtained 
from an experiment performed on a statistical sample, and one assumes that 
the true underlying distribution is an element of a parametrised set or a fam¬ 
ily of distributions. This assumption usually follows from a priori available 
information on the mechanism generating the data [SB] . 

The most familiar example of this problem in physics is probably encountered 
when one has measured the energies of particles in a gas and wishes to de¬ 
termine from this the temperature of the gas, assuming that the probability 
of a particle to have an energy E when the gas has temperature T is given 
by the Boltzmann-Gibbs distribution 

p^(B) = lexp{-^}; Z = j^ exp {-^} p(i5)di5. 

where p represents the density of states. Another familiar example is that of 
the normal distribution 


\/2 


exp 




2P f 
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where /i represents the mean of the distribution and a is the standard devi¬ 
ation or second central moment which is used in many applications. In the 
rest of this introduction, the two-parameter Gaussian densities shall be used 
to illustrate the concepts discussed. The interested reader can hnd a more 
detailed discussion of multivariate Gaussian distributions and their informa¬ 
tion geometrical behaviour in |41j . 

In the above examples, the parameters T, p and a map homeomorphically 
onto the distributions they label. Hence, they can also be considered as co¬ 
ordinates for the family of distributions, which in turn can be viewed as a 
topological manifold. A manifold of statistical distributions (or equivalently: 
statistical measures) is called a statistical manifold, a term first introduced 
by Lauritzen |42] . 


3.2 Tangent spaces 


It is common in information geometry to look at a particular, less abstract 
representation for the tangent spaces than the one used in texts on differential 
geometry. However, this less abstract representation directly inherits all 
structure from the general setting. The intuitive choice is to take as the 
basis vectors of the tangent space T^M not the partial derivative operators 
but rather the derivatives of a particular function of the probability density 
functions. The simplest choice is 



(3.1) 


In the case of the Gaussian density functions, these derivatives become 



Objects such as these are stochastic variables since they depend on x, which 
is an element of a measurable set. However, for reasons of convenience the 
practitioners of information geometry often choose a different representation 
of the tangent vectors. They choose as basis vectors the derivatives of a 
power-like function of the density function, called the a-representation of 
the tangent space. These objects are given by the expressions, for some 
a G M, 



a^l, 

a = 1. 
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The previously mentioned basis vectors fl3.ip are obtained as a special case 
by choosing a = —1. The different representations of basis vectors may 
seem unconventional at first sight. However, they are the conventional basis 
vectors to tangent spaces of other manifolds, those consisting of the stochastic 
variables x e-)■ fa{x). These manifolds are very similar to the statistical 
manifold and so they can be studied in lieu of it. Because of the similarity, 
it is also possible to study one manifold while using the tangent spaces of 
another. It is interesting to note that the functions ia correspond to the 
g-logarithmic functions as introduced by Tsallis [13] through the relation 
a = l-2q [301111]. 

Where the most natural representation of the tangent spaces corresponds to 
the case a = — 1, another very convenient representation is found when a = 1 
is chosen. Not only does this representation have a clear relation to the log 
likelihood 9 i—)■ the tangent vectors in the 1-representation all have 

vanishing expectation value, as 

f d ^ f 

Ee[dilnpe] = / pe{x)—lnpe{x)dx = ^ / Pe{x)dx = 0. 

The tangent spaces in this representation are thus subsets of the set of 
stochastic variables whose expectation value vanishes in a sample described 
by p0. Exactly which subset this is depends on 9 as well as on the particular 
statistical manifold under consideration. 

This latest property illustrates an important perspective of information geo¬ 
metry: a parametrised family of distributions over some measurable set X is 
often explicitly thought of as a manifold embedded in the larger, convex set 
of all distributions over X. From that perspective it is can be seen that the 
tangent planes will depend on the choice of submanifold0 I will not offer a 
detailed discussion of the other representations in this introduction as their 
treatment is more elegant for sets of measures which need not be normalised 
(see for example Chapter 2.6 of [30| for an introduction to this topic). 

For ease of reference, the a=l-representation is called the “exponential rep¬ 
resentation” due to its relation with the logarithm and the «=—1-represen- 
tation is referred to as the “mixture representation”. 


^ It is important to keep in mind this picture is only viable when the set X has a 
finite number of elements. For measurable sets X with an infinite number of elements, 
the simplex of all probability distributions over X cannot be considered a manifold in the 
strict sense of the definition as it is given in the previous chapter. 





3. Information geometry 


26 


3.3 The Riemannian metric 

In order to make the topological manifold M into a Riemannian manifold, a 
choice must be made for the inner product of the tangent spaces. Information 
geometry almost exclusively makes use of the Fisher information metric [M] 
for this purpose. This means the inner product of two tangent vectors V and 
W is dehned to be equal to the covariance (in the statistical sense) of the 
stochastic variables V and W which correspond to the vectors, that is 

g{V,W)\g =Kg[VW] -Eg[V]Eg[W]. (3.2) 

The most commonly used expression for the components of this tensor are 
found in the exponential representation, where they are given by 


gij{9) = Eg[{di\npg){dj\npg)] 

= j^Poix) (^-^Anpg{x^ (^^\npg{x^ dx. (3.3) 

As an example, for a manifold of Gaussian distributions, the Fisher inform¬ 
ation matrix has components 

1 2 

cr) 2 ^ 9iJ.cr{g^y' t') 0 and 

(7 (T 

In his 1925 article [3H], Fisher hrst discussed desirable properties of estimat¬ 
ors; functions of the data which are designed such that their values can be 
used to estimate the parameters of distributions describing the underlying 
data. He argues estimators must attain a fixed value when the sample size of 
the data is increased (“consistency”) and their variance multiplied by sample 
size must be minimal (“efficiency”). This efficiency is important as it lead 
Fisher to the introduction of the intrinsic precision of a probability density 
for the purpose of estimating a parameter 9. In particular, for a distribution 
pg over a set X he defines the quantity 


g{9) = j pg{x) In pg{x)^ dx, 

which is known as the Fisher information and of which 03.31) is the higher¬ 
dimensional generalisation. Fisher himself describes this quantity as 

the amount of information in a single observation belonging to 
such a distribution, 
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by which he means the amount of information contained in this observa¬ 
tion about the parameter(s) of that distribution [38]. The importance of the 
Fisher information to statistics, and to the problem of parameter estimation 
in particular, is contained in the Cramer-Rao theorem [39] . Unbiased estim¬ 
ators for a parameter 9^ are stochastic variables 9^ for which Ee[0^] = 9^. 
The Cramer-Rao theorem states that after making N observations of these 
estimators in a population distributed according to pe, it holds that 

\Ee0^] - Ee[9^]Ee[9^]] - ^g^^{9) ^ 0 , 

where the inequality means that the expression on the left hand side repres¬ 
ents the components of a positive dehnite matrix. The Fisher information 
matrix thus expresses the minimal variance these estimators may attain. 
Furthermore the theorem shows there exists an estimator which attains this 
lower bound on its covariance as the number of observations tends to inhnity. 
Such an estimator is said to be maximally efficient. 

In the same paper [39] where he introduced the theorem which bears his 
name, Rao was the first to endow the statistical manifold with the Fisher 
information matrix as its Riemannian metric. This is possible as the covari¬ 
ance matrix fl3.3p is positive dehnite and its components behave as those of 
a rank two tensor under a change of parameters—the general case of which 
follows from combining equations fl2.2p and fl2.5p — 

d9i 

An application of this transformation property is found in Bayesian probab¬ 
ility. Since the Fisher information metric transforms as a rank two tensor, 
its volume form 

vol(0) = Vdet(^(0)) d9^A...A d^*" 

is unchanged under coordinate transformations. For this reason, Jeffreys 
suggested to use \/det g as a non-informative (though possibly improper) 
prior on the parameter space [15]. 

As a last property of the Fisher information matrix to be discussed here is 
that it also appears in the second (and thereby lowest) order term of the 
expansion of the Kullback-Leibler divergence 

D{p\\pe) = p{x) In dx, (3.4) 
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which in the physical literature is better known as the relative entropy 
More precisely, for 66 sufficiently small it is possible to expand this divergence 
function as a function of the parameters 6 to obtain 


DKL{pe\\P8+5e) 



The Fisher information thus expresses also the infinitesimal distance between 
nearby points on a manifold of probability distributions as expressed by the 
Kullback-Leibler divergence. This validates Rao’s choice to endow statistical 
manifolds with the Fisher information matrix as their Riemannian metric. 


3.4 The affine connections 


Another important differential geometrical quantity is the affine connection 
with which a manifold can be endowed. The first attempt at investigating 
this structure for statistical manifolds was made by Rao. He studied the 
metric connection derived from the Fisher information metric and computed 
geodesic distances [SH]. However, a statistical interpretation of this metric 
connection was not immediately obvious [29]. A breakthrough in the study 
of affine connections on statistical manifolds came in the 1970s with the work 
of Chentsov, Efron and Amari. 

In his notably technical book [16| and the preceding articles, Chentsov showed 
that the space of multinomial distributions admits only a single statistically 
invariant Riemannian metric—the Fisher information metric—and a unique 
family of statistically invariant affine connections. The statistical invariance 
means that the geometric quantities defined through the metric tensor and 
the affine connections remain unchanged when the underlying probability 
space is mapped into another one through a Markov process. Chentsov’s 
work, though published in Russian in 1972, was only translated into English 
ten years later. As a consequence it remained unknown outside the Soviet 
Union for some time after its initial publication. 

The study of these affine connections outside the Soviet Union started in 
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1975, when Bradley Efron published the paper “Defining the Curvature of 
a Statistical Problem (with Applications to Second Order Efficiency)” [17]. 
In this article, Efron studies one-parameter families of distributions, seen as 
curves through the space of all distributions over some measurable set. He 
implicitly dehnes a connection this space such that one-parameter exponen¬ 
tial families coincide with geodesics, which are curves for which the tangent 
vector held is covariant constant along the curve. A number of commentary 
texts have been published together with the article. The last two of these, 
written by Dawid and Reeds, are probably the most important. 

Exponential families are parametrised sets of distributions over a (measur¬ 
able) set X for which there exist functions Hk : ^ M and a function 
$ : M"" —)■ M such that it is possible to write 


Pe{x) = exp ^-<^>{9) (3.5) 

with respect to some given measure [11I17|. The parameters 9 are called the 
canonical parameters of the exponential family. The function $ is called 
the Massieu function |18] . Since the distribution pe must be normalised, the 
function $ satishes 


$(0) = In 



k=l 


dx. 


In what follows it is always assumed that the values of the parameters 9 are 
such that $(0) < oo. 

The Gaussian distributions serving as examples in this chapter belong to the 
exponential family with parameters 


and Hamiltonians 


9 ^ 


1 


and 6*^ 




a 


2 


'Hi{x) = x"^ 


and 'H 2 {x) = x. 


This can be seen from rewriting the expression for the Gaussian density into 
an expression into the form fl3.5p . 

The focus of Efron’s paper is on one-dimensional subsets S of exponential 
families, that is subsets for which it holds that 


A = {pe{x)\9 = F{p),pe N c M}. 
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As Reeds points out in his commentary [IH], Efron implicitly makes a number 
of assumptions about the higher-dimensional exponential family. In partic¬ 
ular, he takes this set to be a Euclidean space, endowed with a constant 
metric tensor equal to the value of the Fisher information metric at 6o, the 
true value of the parameter (or parameters) that one seeks to estimate. This 
assumption allows Efron to consider the one-dimensional submanifold S as 
a curve through Euclidean space and to compute the extrinsic geometrical 
curvature thereof. This curvature is dehned as the length of the vector 


df 


ds 




P 


(3.6) 


where T is the tangent of unit length to the set S and s is the arc length of 
the curve, measured from a point which can be chosen arbitrarily. He then 
dehnes the “statistical curvature” of S to be equal in value to this geometrical 
curvature. When the curvature fl3.6p does not vanish, the one-dimensional 
family is said to be curved in the statistical sense as well as in the geometrical 
sense. Such a family is therefore called a curved exponential family. 

The computation of geometrical curvature through formula fl3.6p requires no 
less than three separate uses of the Riemannian metric (once for the dehni- 
tion of the arc length and twice to compute the length of a vector) as well as 
a covariant derivative (hidden as a regular derivative with respect to the arc 
length). This means Efron’s assumptions of dealing with a Euclidian space 
and of the metric tensor having constant components in the coordinate sys¬ 
tem of canonical parameters are important. 

The true interesting point of Efron’s work lies not in his dehnition of the 
statistical curvature as a quantity itself but in the realisation that the square 
of this curvature plays a crucial role in the properties of the Fisher inform¬ 
ation metric 07 ]. This establishes a connection between the properties of a 
statistical estimation problem and the geometry of the set of distributions 
that is used to model the data. The relation of Efron’s work with affine 
connections is that (non-curved) one-dimensional exponential families coin¬ 
cide with geodesics through the larger space in which they are embedded. 
It was Dawid who identihed the connection implicitly used by Efron in the 
dehnition of the geodesics |50]. This connection would later be known as 
the “exponential connection” and is a member of the family identihed by 
Chentsov. Dawid suggested also the “mixture connection” in his reply and 
remarks that both connections are torsionless as well as hat. It should be 
stressed that these are quantitatively diherent connections and not diherent 
representations of the same connection—despite what the naming similar to 
that of the exponential and mixture representations of the tangent spaces 
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may suggest. Furthermore, Dawid briefly touches Chentsov’s family of con¬ 
nections but he does not study their properties in his reply. 

Another thing remarked by Reeds in his commentary is that the curvature 
used by Efron to define statistical curvature is the extrinsic or embedding 
curvature. This is a subtle but important point as it demonstrates the ne¬ 
cessity of considering the exponential family to be embedded in a larger, 
Euclidean space. In information geometry, this space is the set of all dis¬ 
tributions if it is finite dimensional. In the work presented in this disserta¬ 
tion, however, an analogous larger space cannot be specified in general, even 
though specific examples usually do allow for it. Reeds also initiates the 
work on a higher-dimensional generalisation of Efron’s work. Many authors 
cite L. T. Madsen as having completed this task in her doctoral dissertation 
and the subsequent 1979 paper [51]. Unfortunately, this article seems to be 
very difficult to obtain, perhaps since it has not appeared in a peer reviewed 
journal but rather as a research report for the Danish Medical Society. For 
this reason, it is difficult to tell exactly which of the advances were made by 
dr. Madsen. 

The work of Amari advanced that of Chentsov, Efron, Dawid and Madsen by 
introducing a differential geometric framework for the construction of higher- 
order asymptotic theory of statistical inference (see for instance [3011521153] I in 
which the family of connections introduced by Chentsov plays an important 
role. 

These affine connections are usually presented in a relatively technical way 
but they have very simple interpretations. This was pointed out by Dawid 
already in this reply to Efron’s paper [50] but this piece of knowledge is— 
unfortunately—often ignored in the rest of the introductory literature on the 
subject. Remember that affine connections serve to define covariant deriv¬ 
atives and thereby parallel transport. In the exponential representation, the 
elements of a tangent space TgM are stochastic variables V such that 

1E6)[U] = / p0{x)V{x)(lx = D. 

J X 

However, there is no guarantee that the expectation value of U G T^M 
will also vanish when evaluated at another distribution, that is 1E^[U] = 0 
cannot be guaranteed when ^ ^ 6. Consequently, parallel transport need 
not preserve the vanishing of the above expectation value. The family of 
affine connections of information geometry solves this problem by defining 
parallel transport in such a way that the expectation value of a stochastic 
variable remains zero when it undergoes parallel transport. In particular, the 
exponential connection (a = 1) defines the operation H^^^ : T^M —)• T^M by 
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explicitly subtracting the expectation value in the end point: 




The mixture connection (a = — 1) on the other hand multiplies the statistic 
with the appropriate Radon-Nykodym derivative (see for instance |16] or an 
introductory work on the matter), that is : T^M —)• T^M works as 


= V—. 
Pi 

This means the expectation value equals 



E0[R] 

0 . 


The other affine connections of the a-family are simply linear combinations 
of the exponential and mixture connections, in particular 


2 2 


A similar interpretation exists for the other representations of the tangent 
space but this requires the introduction of the notion of escort probability 
distributions [5D]. This is closely related to the work of Tsallis and an ex¬ 
tensive introduction of escort probabilities can be found in [1] but to treat 
this explicitly would fall outside the scope of this introduction. 

Even though the exponential and mixture connections give rise to a path- 
independent dehnition of parallel transport and are thus flat, the rest of the 
a-family has a constant but non-vanishing scalar curvature (also known as 
the Ricci scalar of the connection), which is given by |29] 



The metric connection associated with the Fisher information metric is ob¬ 
tained in the case a = 0 and so it is the most (positively) curved member of 
this family. As is elucidated in the previous chapter, the metric connection 
is the unique torsionless affine connection satisfying 


X{g{Y, Z)) = ^(V^y, f) + g{Y,SJ^Z). 


02.151 revisited) 
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This property only holds when a = 0 and in general it can be shown that 
for any a G M 

X(g(Y, Z)) = Y) + g{X, (3.7) 

where g represents the Fisher information metric. This shows that the a- and 
—a-connections are dual with respect to this metric. Dual connections play 
an important role in the more advanced topics of information geometry. One 
application is found in a generalisation of the Pythagorean property, which 
holds in a triangle where one leg is a geodesic for the exponential connection 
and the other leg is a geodesic for the mixture connection [30] . 

3.5 Divergence functions 

Due to the importance in information geometry of the relative entropy fl3.4|h 
a central role in this dissertation shall be played by divergence functions, or 
as they are sometimes also called, contrast functions. They shall serve as the 
elementary structure from which all geometric objects shall be constructed, 
just as all the Riemannian geometry of statistical manifolds elucidated above 
can be derived from the divergence function of Kullback and Leibler. 

The Kullback-Leibler distance is a function quantifying in a certain sense the 
difference between statistical distributions over a measurable set X. As was 
mentioned earlier in this chapter, it is given by the expression 

Diq\\p) = q{x) In dx. 

It should be noted that this integral is always defined, even though its value 
may be infinite, as the measures from which p and q are derived are finite [Sj. 
The relative entropy plays an important role not only in statistics but also 
in information theory. In fact, it was introduced by Kullback and Leibler as 
an abstraction of Shannon’s entropy 

3{p) = — / p{x)\np{x)dx, (3.8) 

Jx 

which in turn was introduced for purposes of information theory in [5l], even 
though the expression also appears in the work of Boltzmann and Gibbs. 
The quantity 
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is a measure for the information obtained in the result of a measurement x 
for discrimination between the hypotheses “a; is distributed according to the 
distribution q” and “x is distributed according to the distribution p”. The 
Kullback-Leibler divergence is therefore equal to the mean information for 
discrimination [5] and it is in this sense that it can be said to quantify the 
difference between its arguments. 

A commonly used set of contrast functions are the /-divergences of Csiszar 
[55]. They are of the form 

Dfiq\\p) = 

where / is a convex function for which /(I) = 0. The Csiszar /-divergences 
are the largest class of statistically invariant divergence functions. Another 
often made choice is the set of Bregman divergences [SB] . The most general 
dehnition of these divergences is not limited to statistical distributions but 
when restricted thereto, it is possible to write them in the form 

r rg(^) 

Df{(i\\p) = / / [F\u) — F'{p{x))]dudx. 

JX Jp{x) 

where F is a strictly convex function. These are also known as [/-divergences, 
after their dehnition by Eguchi EZl- It can be shown that the Kullback- 
Leibler divergence is the only contrast function which belongs to both the 
classes of Csiszar and Bregman divergences |29j . 

In order to gain more insight in general divergence functions, it is interesting 
to compare them to the more familiar metrics—not to be mistaken with 
metric tensors. A metric on a set S' is a function 

d:Sx S 

which is zero everywhere on the diagonal of S' x S' and strictly positive else¬ 
where, is symmetric {d{x,y) = d{y,x)) and satishes the triangular inequality 

d{x, y) -|- (i(|/, z) ^ d{x, z). 

Divergence functions on S on the other hand are only required to satisfy the 
hrst condition. An example of a symmetric divergence is the squared Euc¬ 
lidean distance in which satishes the cosine rule instead of the triangular 
inequality: 


d{x, yf + d{y, zf = d{x, zf + 2d{x, y)d{y, z) cos(0). 
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with 0 the angle between the legs of the triangle meeting at the point y. 
Once an appropriate divergence function D over a manifold M has been 
chosen, it can be used to dehne a differential geometric structure upon M, 
see for example [58l|59]. In order for this to be possible, the divergence 
must be sufficiently many times continuously differentiable with respect to 
the coordinates of its arguments and this at least in a neighbourhood of the 
diagonal of M x M. A lot of interesting work in this context has been due 
to Eguchi and collaborators [571 EDI EU but also due to Amari and collab¬ 
orators [5211531 [59] , In their research, they have investigated the geometric 
structure of statistical manifolds as well as more general manifolds endowed 
with divergence functions and applications thereof. 

The first geometric structure to be introduced is, as usual, the metric tensor. 
Since divergence functions over a manifold M are zero everywhere on the di¬ 
agonal of M X M and only there, it must automatically hold that the lowest 
order term in its Taylor expansion is the second order term, that is 



The coefficient of this lowest order term—without the factor ^—is the metric 
tensor induced by the divergence: 



(3.9) 


This is a positive dehnite quantity as it is the matrix of second derivatives 
of a function in a local minimum. Despite this, it behaves properly under a 
coordinate transformation, as can be seen also from the alternative expression 



which can be shown to hold since the hrst derivatives of the divergence vanish 
identically on the diagonal. 

Affine connections can be constructed from divergence functions just as well. 
In particular, it is possible to consider the pair of mutually dual connections 
V and V* dehned through the expressions [5^1^ 



^=e 


and 




(3.10) 
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For more background regarding the geometry induced by divergence func¬ 
tions, the reader is referred to the work of Amari and Eguchi cited in the 
text above. 


3.6 Applications in thermodynamics 

The material introduced so far in this chapter is almost exclusively based 
upon the mathematical literature. However, it deserves mentioning that 
some of these geometrical aspects have also been adopted by the community 
of researchers in thermodynamics. Since the necessary background for the 
following chapters has already been introduced, this overview shall be kept 
fairly short. Nevertheless, the historical development of this particular held 
of research shows some interesting parallels with the goal of this dissertation. 
Indeed, both are instances where geometry is applied in an attempt to un¬ 
earth the mathematical foundations of a physical theory of which the basics 
are less rigorous. 

The hrst connection between differential geometry and thermodynamics was 
made by Constantin Caratheodory, who sought to establish an axiomatic 
basis for thermodynamics. He chose to express this in terms of differential 
geometry [621163] . In these papers he could phrase thermodynamics on sound 
mathematical principles, rather than on the more usual references to imagin¬ 
ary devices such as Carnot engines or to concepts such as the flow of heat II 
For this, he works on the topological manifold of thermodynamic states of 
the system. His rendering of the Second Law states [251IM] 

In every neighbourhood of every equilibrium state x, there are 
states y that are not accessible from x via quasi-static adiabatic 
paths. 

This formulation is weaker than Kelvin’s better known one, which states that 
no cyclical process can exists which turns heat into its mechanical equivalent 
of work. Starting from this axiom, Caratheodory could derive thermody¬ 
namics as it was known in his time. His results therefore extend those of 
Helmholtz, who had already noticed that a dehnition of temperature or en¬ 
tropy does not require cyclical processes or ideal gasses [6^ . 

The next step in building a geometric basis for thermodynamics and statist¬ 
ical mechanics comes from Tisza [55] and Griffiths and Wheeler [55], whose 
contributions may be deemed important more for steering geometric ther¬ 
modynamics away from the difficult formalism of Caratheodory than for 

^ It is interesting to note that Caratheodory introduces heat as a derived rather than 
as a fundamental quantity, which is the approach of conventional thermodynamics. 
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their actual results. A few years later, Weinhold published a series of pa¬ 
pers regarding the metric geometry of equilibrium thermodynamics j67j. His 
main contribution, at least in the context of this discussion, is to endow the 
equation of state surface appearing in the Gibbs formulation of equilibrium 
thermodynamics with a Riemannian metric. This is a completely different 
approach from that of Caratheodory, which is based primarily on differential 
1-forms over the set of states. Weinhold’s metric has components given by 

d^U 

“ dN^dNi 

where U is the internal energy and the set of contains the independent 
conserved extensive quantities of the system (such as volume and particle 
numbers) and the entropy. This is a matrix containing quantities related to 
standard thermodynamic linear response functions such as compressibility 
and specihc heat. Due to the convexity of the internal energy in single phase 
regions, the metric g is positive dehnite. Application of the Cauchy-Schwarz 
and Bessel inequalities allowed Weinhold to derive many of the standard ther¬ 
modynamical inequalities. He did not, however, compute distances between 
points of the equilibrium surface |B5] . 

Four years later, Ruppeiner published an interpretation of the metric struc¬ 
ture and introduced an intrinsic rather than an extrinsic geometry [69]. He 
does this not by endowing the equilibrium surface with a metric but rather by 
introducing the metric on an abstract manifold of equilibrium states. Apart 
from this difference, his metric tensor is related to the one introduced 

by Weinhold through a conformal transformation: 

giR) ^ TgiW)_ 

The interpretation Ruppeiner gives to this metric tensor is very simple. It 
expresses the distance between neighbouring states in the sense that the 
more likely a fluctuation bringing one equilibrium state into another is to 
occur, the closer they are in the Ruppeiner metric. A related result is that 
the geodesic distance is related to the diffusion of the system through state 
space by fluctuations. 

Furthermore, Ruppeiner argues that curvature exhibited by the metric con¬ 
nection is due to interactions in the fluid, as the interactionless ideal fluid 
gives rise to a geometry exhibiting no curvature. He goes on to expand upon 
this idea by arguing that the curvature is proportional to the cube of the 
correlation length of the system. He does this through a line of reasoning 
based on dimensional analysis and scaling relations. This leads to the identi- 
hcation of universal constants, a result supported by experimental evidence 
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(see ini] for the details). 

The metric tensors introduced by Weinhold and Ruppeiner give rise to a 
measure of distance sometimes known under the name of thermodynamic 
length (see for instance HDl and references therein). Nevertheless, there is 
also criticism on the view that it is possible to see this metric tensor as suit¬ 
able to dehne a meaningful distance between points on the equation-of-state 
surface. This was elucidated by Gilmore HH. He showed that also using 
the positive dehnite quantity introduced by Weinhold—which is the second 
fundamental form of this surface—as the metric tensor leads to constraints 
on the third derivatives of the internal energy of a system. Such constraints, 
however, are nowhere to be found in thermodynamics. 

Another interesting result obtained by the application of geometrical consid¬ 
erations is found in other work by Gilmore |H|. Basing himself on fluctuation 
theory just like Ruppeiner and applying the Gramer-Rao bound (which is 
nothing but a consequence of the Gauchy-Schwarz inequality from a geomet¬ 
rical point of view), he was able to obtain uncertainty relations for thermody¬ 
namical quantities in a system undergoing fluctuations. More precisely, if a 
system is in equilibrium with a reservoir with well-dehned intensive variables, 
then variations in the measured values of the system’s extensive variables will 
lead to variations in the estimation of the reservoir’s corresponding intensive 
variables. Gilmore’s result states that the product of the variances of the 
distributions of these quantities must be larger than a constant, for example 

At/A^ ^ kB, 

where the right hand side here represents Boltzmann’s constant. While this 
particular example was already known by Gibbs ua. Gilmore’s line of reas¬ 
oning can be applied to any conjugate pair of intensive or extensive variables. 
Furthermore, it is possible to show that these relations are equivalent to the 
stability criteria of equilibrium thermodynamics (see again [72] for the de¬ 
tails). 



4. THE DATA SET MODEL EORMALISM 


This is the central chapter of this dissertation. In it, the development of 
the data set model formalism is ontlined. First, the different elements which 
must be present in order for the formalism to be applicable are outlined and 
discussed in detail. Afterwards, the geometry of the data set models will be 
constructed and the general consequences investigated. Concrete examples 
and applications are examined in the next chapter. 

4 .1 The elements of the formalism 

4.1.1 The data sets 

Consider a set X of mathematical objects, called data sets, which one would 
like to model by representing them by the elements of a parametrised set. In 
principle X can be any set but the work set out herein will assume X to be 
endowed with a topology. This is truly an assumption and not a demand; 
the formalism can work without such structure. However, it is expected that 
when there is a topology on the set X, as is often the case, one will desire 
that this topology is respected by the modelling process. This will be accom¬ 
modated by the data set model formalism through for instance the demand 
that the mapping from data to model point is continuous. In order to make 
the discussion of this aspect meaningful, it is thus assumed that X has indeed 
been endowed with a topology. 

In statistical physics, the data sets are elements of the simplex of distri¬ 
butions over a measurable set—they represent the empirical data obtained 
from an experiment [1]. Another often encountered example is a collection of 
measurements for which a functional relation is to be found, as is for example 
the case in linear regression. In a quantum mechanical setting, the data to 
be modelled could be data obtained through measurements performed on an 
ensemble of states. An example from machine learning could be a hngerprint 
acquired at the scene of a crime and of which the general class has already 
been determined but which still needs to be characterised to be stored effi¬ 
ciently in a database through which to search in future investigations. 
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4.1.2 The model points 

The data sets in X are intended to be modelled by an element m of a para¬ 
metrised set, which will be called a model point. In order for the methods set 
out in this dissertation to be applicable, this parametrisation must be a ho- 
meomorphism. As such, the parameters can also play the role of coordinates 
and the model points will constitute a topological manifold. The symbol M 
will henceforth be used to denote the manifold of models—and no longer an 
arbitrary manifold. When the parameters are represented by 0, the points of 
the manifold will often be denoted as mg, in analogy with the parametrised 
distributions pg. As it is the the case in information geometry, it is important 
to keep in mind that the number of parameters must be finite in order to 
practice differential geometry with the particular mathematical techniques 
used here. This hnite number, which by dehnition is also the dimension of 
the manifold M, is represented by the letter n in this chapter. Examples of 
inhnitely dimensional models would be the Hilbert space of some quantum 
systems such as the harmonic oscillator, models parametrised by response 
functions... . 

Perhaps the most common example of such a model is the family of first 
order polynomials 


{f\f{x) = ax + b, (a, b) G 

used to fit data points in a plane when practising linear regression. In a 
quantum setting, parametrised subsets of the Hilbert space provide an ob¬ 
vious choice for the models of the formalism. Earlier published work in this 
context considered the coherent states of the (one-dimensional) quantum 
harmonic oscillator, which is a two-dimensional family of the most classical 
states of that quantum system m- In the above example of the hngerprint, 
once the general class has been determined, the exact ridge pattern could be 
described by a limited number of continuously varying parameters indicating 
the positions of characteristic points of the pattern. 

4.1.3 The model map 

The actual process of modelling, determining which model point is chosen to 
represent a given data set, happens by means of the model map p. This map 
is required to be continuous (when a topology on X has been chosen) as a 
small change in the data set should never cause a large change in the model 
chosen to represent it. 

Furthermore, the model map and the divergence function must be compat¬ 
ible. This condition will be elucidated in the upcoming discussion of the 



4. The data set model formalism 


41 


divergence. A consequence of this is, however, that the domain of /x may be 
limited to a subset of X. Therefore it is more correct to identify X as the 
domain of /i and to take into account the possibility that this domain is in 
fact a subset of a larger set. For the sake of succinctness this remark will not 
be reiterated explicitly in those instances where the distinction is clear from 
context. A similar remark holds for the image of the model map; the geomet¬ 
ric quantities that are introduced can only be dehned on the (topological) 
closure of the image of /i. In order for the data set model to be applicable, 
this closure must thus be a manifold in its own right. 

The model map can be assumed not to be injective, as it would make the 
process of modelling redundant. The notation will therefore be used 

to indicate the collection of those data sets x for which ^{x) = m. These 
subsets of X upon which /x attains a constant value will be called hbres. For 
reasons of convenience the metaphor of hbre bundles is also used in saying a 
model point m is the projection of a data set x, by which again it is meant 
that /x(x) = m. It should be noted that this nomenclature is suggestive 
rather than rigorous. The set X can be likened to the total space of a hbre 
bundle, the manifold M to the base space and the model map /x to the pro¬ 
jection but together these objects cannot be properly called a hbre bundle. 
The dehnition of a proper hbre bundle requires the hbres to be copies of the 
same space [2S1I3Z] and this is not generally the case for the sets 
In certain contexts, such as a physical experiment, one may not have access 
to the data sets itself but only to a limited list of observable quantities. When 
this occurs, the data sets may acquire extra structure. In the simplest case 
the data sets can only be described by a hnite number of real numbers, in 
which case X will adopt a manifold structure. Also the divergence function 
may induce more properties upon the set X. Though this may give rise to 
specihc consequences, these were not investigated explicitly and hence this 
dissertation does not report upon results concerning such additional struc¬ 
ture. 


4.1.4 The divergence 

A crucial element in the modelling process considered is a divergence func¬ 
tion, which is consistently denoted as V. The value D{x\\m) is a measure of 
how well the model point m G M describes the data set x G X and where 
a smaller value indicates a better match. As such, it can be thought of as 
a “badness of £t”. An alternative viewpoint is that D{x\\m) represents the 
cost of describing x by means of m. Such perspective can be found treated 
in more detail for instance in the work of Topspe [ZSHIZ] but this idea is not 
actively entertained in this dissertation. 
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The divergence function must satisfy four conditions, most of which are in 
some way related to the properties of divergence functions as they were 
treated in the previous chapter. These conditions are: 

1. The domain of D contains Dom(/i) x Im(/i). In many practical problems 
the divergence will imply the model map and its properties, making this 
condition largely trivial. 

2. The divergence is sufficiently many times continuously differentiable 
with respect to the parameters of its second argument. This demand 
must hold for each of the different coordinate systems used. 

3. Given any x G Dom(/i), the function 

m i-G- D{x\\m) 

has a local minimum at m = fi{x). This is the condition of compatib¬ 
ility of the model map and the divergence that was mentioned earlier. 
That fr{x) provides a unique global minimum is only strictly required 
for the generalised Pythagorean theorem. Nevertheless, this will be 
assumed to hold as well. 

4. For all mg G Im(/i), the function 

X I—)■ didjD{x\\mg) 

has a constant value on the hbre of mg. The drawings illustrate this 
schematically for two divergence functions on a 1-dimensional M. For 
both drawings, two data points x and y within the same hbre are 
chosen. In the leftmost drawing, the curvature of both graphs in the 
minimum is the same, though this is not the case elsewhere, and the 
condition is satished. In the rightmost drawing, the curvature is not 
the same in the minimum even though the graphs may seem more 
similar—they are both parabolas—and the condition is not satished. 


D 



— D{x\\mg) 
--- D{y\\mg) 



9 
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From the assumption of a global minimum follows that the divergence is 
bounded from below, at least on Dom(/i) x Im(/i). Without loss of generality, 
this function can then be assumed to attain positive values only. In previous 
work [7^178] a stronger condition was imposed but this turns out to be largely 
unnecessary. Furthermore, it is often convenient to impose extra conditions, 
depending on the presence of a topology for X. 


5. When a topology for X has been specihed, it will be assumed that the 
divergence is also continuous in its argument. (The continuity in the 
second argument is already implied by Condition 2.) 

6. Given any x G Dom(/i), there exists a neighbourhood A4 of x wherein 
there exist data sets x^^'^ such that the numbers 


dkD{x^^'^\\^i{x)) 


make up the components of an invertible matrix. 

A stronger version of this condition is that in this neighbourhood, the 
function y i-G- dkD{y\\y{x))d9’^ is continuous and has an image homeo- 
morphic to a subset of M”. Then there exist functions mapping a 
neighbourhood of 0 (g M) to A4 such that 


-dkD{X\e)\Hx)) 


= 6l 


£ = 0 


This is a construction similar to one used in the study of gradient 
flows on metric spaces—see for instance na for an introduction to 
that discipline. 


At some points in this dissertation, it will be necessary to distinguish between 
the dehnition of a divergence function given here and the more restrictive 
dehnition that is presented in the introduction. To make the distinction 
where confusion may arise, the latter kind will sometimes be referred to as a 
“proper divergence (function)”. 

Together the sets and maps X, M, y and D make up what will called the 
data set model (X, M,/i, D). In what follows, the geometry of these models 
will be studied and it will be elucidated how this geometry gives rise to a 
method to easily determine whether or not a given parametrised family of 
distributions belongs to an exponential family. 
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4.2 The geometry of data set models 

4.2.1 Topology 

For the sake of completeness, it is worthwhile to consider for a moment the 
topologies of the set X and the manifold M. Many, if not all, important 
functions in this dissertation have these two sets (or at least subsets thereof) 
as either their domain or as their co-domain and many of these functions can 
be demanded to be continuous. Examples thereof include the model map 
(X —)■ M), the divergence function (X x M —)■ M) and the coordinate func¬ 
tions (M —)■ M”). 

That M is a manifold is actually a property of its topology. In particular, the 
local homeomorphism of M with M" is equivalent to stating that neighbour¬ 
hoods of model points in M are, from a topological point of view, Tychonoff 
spaces [51]. 


4.2.2 The Riemannian metric 

In order to induce a geometric structure on the manifold of models, a metric 
tensor must be constructed. This can be done in analogy with the Fisher 
information metric and the metric tensor derived from divergence functions. 
In those examples, the components of the metric tensor are second derivatives 
of the divergence. The geometric interpretation of this is a generalisation of 
Fisher’s ideas as they were discussed in the previous chapter. 

Fix a data set x and consider the surface dehned by the graph of the map 

6 !-)■ D{x\\m0) 

on a neighbourhood of the coordinates of fi{x). In order to obtain a good 
measure of information in Fisher’s sense, it is required to look at how strongly 
peaked the graph is around its minimum as this encodes the sensitivity of 
the divergence for small variations in the chosen model point. This sharpness 
is nothing else than the extrinsic or embedding curvature of the surface in 
and this can be quantihed using the theory of surface geometry. In 
particular, it is encoded in the second fundamental form and it can be shown 
that in the minimum, the curvature is quantihed by the values 

didjD{x\\g{x)). 

The logical choice for the metric tensor would therefore be an expression of 
the form 


gij{ 0 ) = didjD{x\\m 0 ) with g{x) = m^. 


(4.1) 
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This expression is only well-defined, however, if the divergence satisfies Con¬ 
dition 4 set out earlier in this chapter. The tensor fid.ljl will be called the 
generalised Fisher information metric. Although the it is defined as a second 
derivative, which in general does not behave under coordinate transform¬ 
ations in the way a metric tensor ought to, no problems arise with this 
definition. 

Theorem 1. The definition fjf.ll ) for the metric tensor is invariant under 
coordinate transformations. 

Proof. Choose x G and a transformation from the coordinates {0*} 

to the coordinates {C“} = {Z°'{0)}. Then the components of the metric 
tensor transform as 


gij{9) = didjD{x\\m0) 

f)7^ f)7^ 

f)7a 

This completes the proof. □ 


The expression fld.ip achieves this coordinate invariance without changing 
the second derivative into a Hessian, which would require a connection to be 
chosen. Since such a choice would necessarily be arbitrary at this point, it is 
a benefit of this approach that this scenario can be avoided. As such, a met¬ 
ric can be defined under relatively weak conditions; it will be shown shortly 
that the introduction of the affine connection places a stronger demand on 
the divergence. 

The Kullback-Leibler divergence satisfies Condition 4 when M is an expo¬ 
nential family of probability distributions, since—with {0®} the canonical 
parameters— 

d f d 

didjD{p\\pe) = j^p{x)—\npe{x) dx 

= d,dj^{e). 


This is obviously a strong way to satisfy this condition, as the derivatives 
of the Massieu function $ are completely independent of the distribution p. 
However, it is not necessary for the existence of a Riemannian geometry that 
such a strong condition is imposed. An example of a data set model outside 
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the field of information geometry allowing for such a metric tensor is that 
of the grand canonical ensemble of bosonic particles, treated in detail in the 
next chapter. The Gumbel distributions discussed shortly thereafter are an 
example of a family of distributions which do not satisfy Condition 4 when 
the same Kullback-Leibler divergence is used. 

4.2.3 On the properties of exponential families 

After a metric tensor has been constructed for a data set model, the next 
geometric structure to be treated is the affine connection. In order to guide 
the development of the data set model formalism, it is useful to first discuss 
particular properties of exponential families. 

As is mentioned in the introduction, Efron introduced the statistical cur¬ 
vature of one-dimensional submanifolds of exponential families. This was 
generalised to higher dimensions, first by Reeds and then by Madsen and 
Amari. A recurring property in their work is that the manifold of distribu¬ 
tions belonging to an exponential family is fiat when it is endowed with the 
exponential connection. (Amari has also shown similar results for general¬ 
ised families, see [22] •) This is an important element of the argument for the 
choice of affine connection for data set models. 

Exponential families are parametrised sets of distributions that can be writ¬ 
ten in the canonical form 

Pg(x) = exp{-^(0} - h^nkix)}. (4.2) 

This means that for any distribution p, the second derivative of the Kullback- 
Leibler divergence evaluated in the pair {p,pe) satisfies 

didjD{p\\pg) = didj^{e), 

still in coordinates coinciding with the canonical parametrisation. Using 
arbitrary coordinates {C“} and the notation of data set models, this becomes 

V,VfeD(a;||mc) = V,Vb<h(C) (4.3) 

for all a; G X. That the Hessian of the divergence function does not depend 
its first argument is an important property since it enables the construction 
of an affine connection for the model manifold in the general formalism. As 
an additional advantage, it will be shown that the property fl4.3p cannot hold 
for an arbitrary connection. In fact that connection is unique—a claim the 
proof of which will be provided by the explicit construction of the coefficients 
in the next subsection. 

Its uniqueness is not the only interesting property this connection exhibits 
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however. It turns out that equation fl4.3p gives rise to a rich geometry. For 
this reason, data set models satisfying this equation are said to exhibit a 
Hessian structure. Another name could be “Legendre structure”. This name 
is inspired by its use for very similar—though not necessarily identical— 
ideas in the literature iniiHniEi]. The discussion of this structure and its 
properties provides the content of the next subsection. It is only afterwards 
that the connection will be constructed starting from equation fl4.3p . thereby 
also showing its uniqueness. 

4.2.4 The Hessian structure 

The property dehning the Hessian structure for a data set model (X, M, /r, D) 
is the relation fl4.3p . 

VaVfeT)(x||m^) = VaVf,<h(C)- 04.31 revisited! 

However, it is a sufficient condition that the left hand side of this expression 
is independent of the data set x. That there then always exists a function 
<h satisfying equation 04.31) is presented as a Theorem below. Note that this 
property implies a well-dehned metric tensor on M; it is a stronger version 
of Condition 4. 

A most useful observation is that the Hessian of the divergence is equal to 
the metric tensor, which can be shown through a short computation. Indeed, 
choose X G then 

VaVbT)(a;||m^) = dadbD{x\\m(;) - a;%(C)9fcT)(a;||m^) 

= dadbD{x\\m^) 

— 9ab{C)- 

Even though this argument only holds for data sets which have as their 
projection on M, the condition fl4.3p implies that 

VaVbD{x\\m(;) = gabiC) Vx (4.4) 

Another consequence of equation fl4.3p is that it imposes conditions on the 
properties of the connection V. The hrst of these is that V is torsionless or, 
expressed through its coefficients 

c^U(C) = c^‘^fea(C), (4.5) 

which follows from the dehnition of the Hessian 02.171) and the symmetry of 
both the matrix of second derivatives and of the components of the metric 
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tensor. The other two properties can be determined by computing the partial 
derivative of equation fl4.4j) . Indeed, consider 

daQbc = da{dbdcD{x\\m(;) - u‘^bcddD{x\\mi;)) 

= dadbdcD{x\\m^) - {daUj'^bc)ddD{x\\m^) - uj'^cbdaddD{x\\m^) 

= dadbdcD{x\\mt;) - {daUj‘^bc)ddD{x\\mt;) 

^ bc^Had UJ I )]• 

Subtracting this equation from itself with the indices a and b interchanged 
and rearranging the terms yields 

0 be (Db^ ac T (X ae^ be ^ be^ 

T \da9bc dbfjac T 9ad^ be 9bd^ ae\ • 

Since this must hold for all data sets x, the two expressions between square 
brackets must vanish independently and as such the two other conditions are 
obtained. The best known of these, 

daUj'^be - dbOj’^ae + Oj'^aeUJ^'be “ Uj‘^beUj\e = 0, (4.6) 

has already been identihed in the introduction as expressing the connection to 
be flat and its consequences have been discussed there. The other condition, 

da9bc db9ae T 9ad^ be 9bd^ ac 0) 

is of the same form as the Codazzi-Peterson equation, which originated as 
a condition on the second fundamental form of a two-dimensional surface 
embedded in [28l[82l|83] . Since the generalised Fisher information metric 
is obtained as the second fundamental form of a surface—be it a different 
surface for every value of 6 —this property is not wholly unexpected. There 
is, however, also another interpretation. Using equation fl2.16p to obtain the 
coefficients xj^ab of the connection V* dual to V, fl4.7p can be rewritten into 

9cd{T^ ab ^ ba) 0 , 

which means the dual connection is also torsionless. The three conditions 
fl4.5lh fl4.6p and fl4.7|) play an important role in the rest of this dissertation: 
they will directly or indirectly give rise to the rich properties of the Hessian 
structure and they will be used as a way to verify whether or not a paramet¬ 
rised family of distributions belongs to the exponential family. Their hrst 
use, however, is in showing the existence of a generalised Massieu function 
for data set models—a proof promised to the reader in the beginning of this 
subsection. 
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Theorem 2. Given a data set model fX, M,/x, D), for which there exists a 
connection V such that 


VaVbT>(a;||m^) 

is independent of x, there exists a function $ : M"" —)■ M such that its Hessian 
(with respect to V) equals the Hessian of the divergence function. 

Proof. In the preceding discnssion, the x-independence of VaVfe-D(x| |m^) was 
already shown to lead to the equality of this Hessian to the metric tensor. As 
such, it is only required to show that there exists a solution d) to the system 
of differential equations 

a,a,«I>(C)-n;'^afe(C)5c^>(C)=^a6(C). 

As an intermediate step, consider the differential equations 

daMC) - ^^^abiOadC) = 9ab{C)- (4-8) 

This is a system of the Mayer-Lie type and it is integrable if and only if 
equations fl4.6p and fl4.7p are satisfiedlll Since these conditions were shown to 
be a consequence of this theorem’s premise, a solution a : M” —)■ M"' to the 
system fl4.8p is thus found to exist. 

The only step required to complete the proof is to show that ab = db^ for 
some function $. Since the metric tensor is symmetric and the connection 
V is torsionless by assumption, it follows from fl4.8p that 

daMC) - dbaaiO = 0 . 

Invoking Poincare’s lemma shows the existence of a potential for a *=' ttcdC'^, 
which is exactly what is needed to complete the proof. □ 

As a consequence of this theorem, it immediately follows that 
habiC) = VaVbT)(x||m^) = VaVfe<l>. 

A metric that can be written as the Hessian of a generalised Massieu func¬ 
tion is called a Hessian metric (see also [83]). Since the metric is a positive 
definite tensor, this also shows that $ is a strictly convex function when 
expressed in the affine coordinates of V. (In the remainder of this subsec¬ 
tion, the notation 0* will consistently be used for these affine coordinates.) 


^ The detailed proof of this statement is rather long and technical. The interested reader 
is therefore referred to Frankel’s excellent book EH], which contains the argument in full. 
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Without any additional constraints, the Massieu function thus has a well- 
dehned Legendre-Fenchel transform S, dehned by 

S{U) = inf{$(0) + e'^Uk} 

9 

and where the convention common in the physics literature has been usedH 
The function S can be called the generalised thermodynamic entropy in ana¬ 
logy with statistical mechanics. 

From the equation fl4.4p and Theorem 2 follows that the divergence of a 
data set model exhibiting a Hessian structure takes a very simple form when 
expressed in the affine coordinates of the connection. Indeed, in such co¬ 
ordinates it is possible to write 


didjD{x\\m0) = gij{9) = didj^{9). 

This leads to the simple expression 

D{x\\me) = ^{9) + 9'"qk{x) - a{x), (4.9) 

where the values of the functions Qk and a depend on the particular choice 
for 4), which is determined only up to an affine term in the ^-coordinates. 
However, the functions qk must have constant values on the different hbres. 
This is the case since data sets x contained in the hbre of mg must satisfy 

dkD{x\\me) = 0 and so dk^{9) = —qk{x). 

Introduce the function u : M”' —)■ R” through 

Uk{9) = qk{x) where /i(a;) = mg. 

Since the functions qk are constant on the hbres, the function u is well- 
dehned. It is also a homeomorphism as it is dehned through the derivative 
of a strictly convex function. This function u is the key to interpreting the 
precise meaning of the generalised Fisher information metric. After all, 

gij{9) = didj^{9) = -diUj{9). (4.10) 

Because of the dehnition of the function u, the Fisher information thus quan- 
tihes how sensitive the values of the functions qk are to a change in data set, 
or more precisely how sensitive these values are to a change in the hbre con¬ 
taining the data set. Perhaps an easier way to think about this is that the 

^ Another possible convention would use supgfd^^C/fe — $(0)} to obtain a second convex 
function, whereas S as introduced in the main body of the text is concave. 
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inverse of this tensor expresses how sensitive the parameters of a data set’s 
projection is to a small change in the observed values of the qk- This gen¬ 
eralises the meaning of the Fisher information metric in statistics, where it 
has this same interpretation for exponential families, with the expectation 
values of the Hamiltonians taking the role of the Qk It is important to 
remember that this is a purely geometric statement, concerned only with the 
sensitivity of the parameters to a change in the values of the functions qk- 
It does not support statements about the accuracy—how close to or how far 
from the true value of the parameters a given choice of model point is. 

An obvious question to ask in any theory pertaining to statistics is whether 
or not a generalisation of the Cramer-Rao bound [3H] can be formulated. At 
this point an elegant answer to that question can be provided. It is indeed 
possible to produce an inequality reminiscent of this important result, al¬ 
though its interpretation as a bound on the variances of estimators does not 
generally hold. Instead, it is possible to place a bound on the derivatives 
of the function u, which reduces to the Cramer-Rao bound in the case of 
exponential families of probability distributions. 

Since the metric tensor is positive definite, it must satisfy the inequality of 
Cauchy-Schwarz, 




for all vectors v and w. Using the equalities 04.101) to replace the components 
of the metric, it is found after some rearranging that 


—V^V^iUj ^ 


{y^wWidj^Y 

w'^wlgij 


(4.11) 


The mathematical form of this inequality is very similar to a general expres¬ 
sion of the Cramer-Rao bound for multivariate exponential families and even 
for deformed exponential families |1]. Indeed, in the former case it holds that 


-diu^e) = -d,Ee[Hj] 

= - Eg[H,]Eg[H,], 


where Eg[Hi] is the expectation value of the Hamiltonian TLi with respect to 
Pe- Substituting this expression back into fld.lip yields the promised result. 


4.2.5 Computing the connection coefficients 

The construction of the connection coefficients for a data set model exhibiting 
a Hessian structure will proceed from equation fl4.4p . This formulation is 
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easier to use in practice than fl4.3p since one has access to the metric tensor 
in principle as soon as the data set model has been specihed, whereas the 
generalised Massieu function <h is harder to hnd. 

The sixth condition on the divergence implies the existence of data sets 
such that 


form the components of an invertible matrix A. By employing these data 
sets it is possible to isolate the connection coefficients from equation 04.41) . 
Indeed, choose x G then the Hessian structure implies 

0 = Va'VbD{x^''^\\mQ) - Va'VbD{x\\m(^) 

= dadbD{x^''^\\m(;) - u'^ab{OddD{x'^''^\\m^) - gab{C) 

= dadbD{x^'''^\\m^) - - QabiC)- 


From this it follows easily that 

n 

c^U(C) = XI (^~^(OTd[9adbD{x^^^\mQ) - C/a6(C)]- 

d=l 


In many practical applications a more careful choice of can be made such 
that H is a diagonal matrix. The expression for the connection coefficients can 
then be written in the simpler form—without using Einstein summation— 


^U(C) 


dadbD{x^''^\\m(;) - gabiC) 
dcD{x^'^'> I |m^) 


(4.12) 


In order to be able to make a link with the definition of connections in the 
existing literature, however, requires use of the curves appearing as a 
consequence of the stronger form of Condition 6. Note that these curves 
will in general have an implicit parameter dependency—as their definition 
requires the choice of a model point /i(X^“^(0))—even though the notation 
does not indicate this. Starting again from equation 04.41) and deriving with 
respect to the parameter of the curve shows 


0 = ^^VaVbD{X%e)\\m^) 


6=0 


= —dadbD{X%e)\\mt;) 
de 

= -^dadbD{X''{e)\\m(;) 


6 = 0 


oj\b{C)^ddD{X%e)\\m^) 
-oo^abiC). 


6=0 


( 4 . 13 ) 


£ = 0 
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This definition is more technical than it needs to be for practical pnrposes. 
In particular, it is again sufficient that the expression 


d 

de 


dbD{X%e)\\m0) 


£=0 


is an invertible matrix, instead of specifically the identity, so the connection 
coefficients can be determined uniquely. 

It is instructive to remark that the connection introduced by applying the 
the definitions fl4.12p or fl4.13l) to a statistical model endowed with the relat¬ 
ive entropy as its divergence can be identified as the exponential connection 
of Efron and Reeds. To be precise, it is the construction of the connection 
that generalises the construction of the exponential connection in informa¬ 
tion geometry. Using the same definition for different divergence functions 
than yields different connections. For example, using the Kullback-Leibler 
divergence with its arguments interchanged would yield the mixture connec¬ 
tion [211130]. 

The absence of a derivative in expression fl4.12p compared to expression fl4.13p 
may give the impression that the discrete method is easier and less likely to 
cause mathematical problems. This need not be the case, however. For a 
general data set model, there is no guarantee that the connection coefficients 
defined through the expression fl4.12p are independent of the choice of the 
data sets . This will only be the case when the data set model exhibits a 
Hessian structure. 


4.3 A generalised Pythagorean theorem 

For some data set models, the analysis of their geometric properties can be 
simplified considerably if a specific property is satisfied. In particular, this 
property allows one to reproduce much of the geometry of proper divergence 
functions as they are discussed in Chapter 3. 

A sufficient demand for a model to satisfy Condition 4 as well as for a con¬ 
nection to exist is that for all {x, m) G Dom(Zi)) it holds that the difference 

D{x\\m) — D{x\\jji{x)) (4.14) 

does not depend on the chosen data set x, only on its projection gi^x). If for 
a data set model (X, M, p, D) the function 

me^D{x\\m) (4.15) 

has its unique global minimum in g{x) for all x G Dom(/i), this property de¬ 
termines a proper divergence function on the manifold M. Abusing notation. 
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it is then possible to define 

D{iJ,{x)\\m) '^=' D{x\\m) — D{x\\ia{x)). (4.16) 

It is quickly verified that this construction does indeed give rise to a proper 
divergence function. Since p(a;) is assumed to be the unique point where 
the function fld.lSp reaches its global minimum, it holds that D{x\\m) ^ 
D{x\\fi{x)) and as a consequence D{fi{x)\\m) ^ 0 with equality only if fi{x) = 
m. When /i(x) = m, the difference between D{x\\m) and D{x\\fi{x)) vanishes 
trivially and so the newly defined divergence equals zero everywhere on the 
diagonal of M x M. 

If the left hand side of equation fl4.16p is well defined on the domain of D, 
this equality will be referred to as the generalised Pythagorean theorem. An 
illustration is given in the drawing, where the divergence between points is 
represented by the squared length of the dashed line between them. 

X 



In some examples, such as in information geometry, it is the case that M C X, 
making it possible that the divergence function is actually defined on X x X 
as a whole. Such divergence functions may satisfy a stronger version of the 
generalised Pythagorean theorem. An example of this is the Kullback-Leibler 
divergence, for which it holds that 

D{p\\r) = D{p\\q) + D{q\\r) 

when q is the projection of p on a particular choice of submanifold containing 
the distribution r [30] ■ However, some of these divergences will fail to satisfy 
the generalised Pythagorean theorem on X x M yet condition fl4.14p may still 
hold for all data sets and model points. The latter condition is thus weaker 
than the former. In that case it is important to formulate the Pythagorean 
theorem through expression fl4.16p rather than using the original expression 
for the divergence in all three terms. 

For data set models satisfying the property fl4.16p . it automatically also holds 
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that the metric tensor can have its components written as 


gij{9) = didjD{x\\m0) 

= didj [D{x\\iJ,{x)) + D{ia{x)\\m 0 )] 
= didjD{n{x)\\mg). 


Indeed, when x is an element of the fibre of mg, both argnments of the 
expression didjD{fi{x)\\mg) coincide. This in tnrn is the dehnition fl3.9p for 
the metric tensor derived from proper divergence fnnctions presented earlier. 
Also the dehnition of the connection coefficients fl4.13p can be expressed in 
terms of the proper divergence. Following an analogons line of reasoning as 
for the metric tensor, one obtains 


= —didjD{X\e)\\mg) 


s =0 


=-d,d,D{fr{X\emme) 


6=0 


To show the correspondence with the dehnition fl3.10p for the connection 
coefficients derived from proper divergence fnnctions, hx 9 and dehne 0^*^^ 
by 


/i(X'=(£)) =mew(,), X\0)=me. 

Then it is possible to write 

£=0 

= ^dkD{m^\\mg)\ 


d0(O^ 

= -9sk{9)- 


£ = 0 


de 


6=0 


This knowledge can be nsed to compnte the connection coefficients in an 
analogons way as above: 


uJ^ij{9) = -did,D{X>^{e)\\mg) 


6 = 0 


/ (9 \ 


£=0 


= 9^%0) 


d 


de 


didjD{m^\\mg) 


^=e 


This is indeed the dehnition fonnd in the literatnre and a reported npon in 
the previons chapter, in particnlar in eqnation fld.lOp . 
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4.4 Identifying exponential families 

This section contains two easily obtained results concerning exponential fam¬ 
ilies of probability distributions. In particular, it is shown that a statistical 
model belongs to the exponential family if and only if it exhibits a Hessian 
structure. This is useful since this allows for the establishment of both posit¬ 
ive and negative results through a straightforward step-by-step process. For 
exponential families, it is shown that the canonical parameters can be found 
by solving a system of linear differential equations. I do not claim originality 
of these results, however. Especially the second one I expect to be contained 
in at least some books on differential geometry as a method to hnd the affine 
coordinates of a connection. The hrst result on the other hand, genuinely ap¬ 
pears to be missing form the important reference works—both on exponential 
families and on Hessian structures—as well as from the available historically 
important papers cited in the introduction. Since both properties will be 
used extensively in the treatment of the examples in the next chapter, it is 
thus instructive to (re-)derive them here. 

It was shown earlier in this chapter that exponential families exhibit a Hes¬ 
sian structure. The converse is also true: the presence of a Hessian structure 
in a statistical model only occurs for an exponential family—assuming that 
the divergence is chosen to be that of Kullback and Leibler. To show this, 
assume a statistical model with a Hessian structure derived from the relative 
entropy. Since the connection is necessarily flat and torsionless, it allows for 
affine coordinates 6 . In these coordinates the Hessian of the divergence reads 

didjD(p\\pf)) =-didj / p(x)lnpQ(x)dx 

Jx 

and by assumption both sides are of this equality are independent of the 
distribution p. But this is only possible when 


X I—)■ didj hrpolx) 

is a function of 9 only. With a bit of foresight, this function can be identihed 
as — <F. Then it follows that 

Inpelx) = -$( 6 ') - d^Uk{x) 

for properly chosen functions Tik- (A term independent of 9 could be ab¬ 
sorbed in the dehnition of the measure da; and can thus be discarded.) This 
shows the result that was promised to the reader. Due to its practical im¬ 
portance, it is summarised in its own theorem: 
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Theorem 3. A statistical model belongs to the exponential family if and only 
if it exhibits a Hessian structure when it is endowed with the Kullback-Leibler 
divergence 


A recurring property in the work of Efron, Reeds and Amari is that expo¬ 
nential families of probability distributions are flat when endowed with the 
exponential connection [2^471149] . Since the connection of the data set model 
geometry generalises this exponental connection, Theorem 3 can be seen as 
the data set model generalisation of that information geometric property and 
its converse. 

A useful corollary pertains to the canonical parameters of the exponential 
families. When employed as coordinates for the model manifold, they serve 
as the affine coordinates for the flat and torsionless (exponential) connection. 
This means that these parameters 0* can be found as a function of arbitrary 
coordinates by studying the connection. More specihcally, the connec¬ 
tion coefficients are in general dehned through fl2.8|i and this relation can be 
transformed from general coordinates to the affine coordinates = 0*(C)- 
This looks like 


^ abdc V adb 

= {d^d,e^)d, + {di,e^)idae^){v,d,) 

= {dadhe^)d, + {dbe^){daenuj\jdk, 

where letters from the beginning of the alphabet refer to the arbitrary (- 
coordinates and letters from the middle of the alphabet belong to the affine 
coordinates 6 . The coefficients thus vanish identically by assumption. 
Computing the 6*t-component of the left hand side and using d9^{dc) = dcQ^ 
reveals 


a„a,-0^'(C)=a;'=a6(C)5c0^'(C). (4.17) 

This is a system of linear differential equations which may be solved to hnd 
the function ( i—)■ 0(C) expressing the affine coordinates as a function of 
arbitrary coordinates. 
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4.5 Discussion 

It is important as well as instructive to compare the data set model formal¬ 
ism with the theory of information geometry upon which it is based. 

First amongst these differences is of course the fact that probability theory 
has been deliberately excluded from the building blocks of the formalism. 
As is explained in the introduction, this was done in order to identify the 
crucial link between information theory and the geometry of the formalism, 
rather than relying on the much-walked path of basing information theory 
on probability theory. The possibility of using properties ultimately based 
on probability—on purpose or by accident—is always present when study¬ 
ing information geometry. After all, despite its geometric nature, it still 
relies in part on properties of the objects it deals with to show theorems and 
additional properties, as a read-through of the literature referred to in the 
previous chapter will indicate. Naturally, this cannot be taken as a criticism 
of information geometry in itself, which has many useful applications in a 
statistical context which rely on exactly those properties. 

A striking contrast with existing methods of information geometry, and per¬ 
haps the most promising aspect of the data set model formalism in terms 
of applications, is the explicit option to model data through qualitatively 
different mathematical objects. In information geometry, as well as in the 
study of proper divergence functions, the divergence functions take both ar¬ 
guments from the same set or from a set and a subset thereof, the latter of 
which then serves as the model for the former. By taking into account the 
possibility that both arguments of the divergence are qualitatively different 
a much wider array of possible models can be described and can have their 
divergence-implied geometry constructed and studied. This general formal¬ 
ism strongly resembles that of pattern recognition in machine learning, see for 
instance [H] for a good comparison between pattern recognition and inform¬ 
ation geometry. This generality does, however, come at the price of losing 
some structure which does exist in other formalisms and contexts. The fam¬ 
ily of affine connections introduced by Chentsov and Amari is an example of 
such a structure that could not be replicated. 

People familiar with information geometry may perceive the emphasis of the 
data set model formalism on intrinsic geometry rather than the extrinsic geo¬ 
metry as an important contrast. Nevertheless, this offers a great advantage 
in terms of generality as not every model manifold has an obvious embed¬ 
ding in some larger set. This embedding is required to enable the study of an 
extrinsic geometry. Removing the need for an embedding also opens up the 
possibility of considering the entire simplex of everywhere strictly positive 
probability distributions over a measurable set as the model manifold. Do- 
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ing this is rather unnatural in the usual formalism of information geometry 
since there this simplex serves as the space in which the model is embedded 
and from which it derives its geometry. 

There are also some strong similarities between the data set model formalism 
and information geometry. The first one is obviously that it is possible to 
reconstruct the metric tensor and the exponential connection from the latter 
theory in the former, at least under certain conditions. The reconstruction 
of a family of connections has not been achieved, however. The metric tensor 
is used in quantifying information in a way which is very analogous to its 
use in information geometry: both are a measure for the sensitivity of the 
modelling process to changes in the data—or more precisely: changes in the 
fibre containing the data. 

It also proved possible to reconstruct a number of useful properties. The 
most obvious of these is probably the vanishing curvature of statistical man¬ 
ifolds corresponding to exponential families. This result is well-known from 
information geometry, as it was one of the starting points for the study of 
connections in this field. The fact that exponential families can be identified 
by means of their flatness and the correspondence of canonical paramet¬ 
ers and affine coordinates was already implied in the work of Amari, who 
demonstrated this fact through direct computation [291 ED]. However, the 
above work shows that a similar result can be obtained in a more general 
context by identifying all data set models which give rise to a Hessian metric 
as exponential families. 

The generalised Pythagorean theorem is also interesting when comparing the 
data set model formalism the study of proper divergence functions. It makes 
clear that the data set model geometry is a true extension of the differential 
geometry induced by proper divergence functions on a manifold. 



5. EXAMPLES AND APPLICATIONS 


This last chapter is devoted to the illustration of the data set formalism. A 
few familiar examples from statistics and statistical physics will be presen¬ 
ted to show how the formalism developed in the previous chapter behaves 
when applied to these models. There are a total of six examples, each of 
which demonstrates one or more important properties of the data set model 
formalism. 


5.1 The normal distributions 

5.1.1 Using the relative entropy 

In this hrst example, the Gaussian probability densities (or normal distribu¬ 
tions) will be covered as a model for probability densities over R. This family 
is also used as an example in Chapter 3. The results found there are mostly 
rederived, now along the lines of the data set model formalism as set out in 
the previous chapter. It is also illustrated how the canonical coordinates of 
the exponential can be found in a straightforward manner by constructing 
the affine connection. 

The data sets represent an empirical distribution over the real numbers with 
hnite hrst and second moments. In information geometry this set would 
be treated as an inhnitely dimensional manifold of which the normal distri- 
bntions are a submanifold. The data set model formalism, however, does 
not employ this additional structure. The task at hand in this example is 
then to hnd the best htting member of the two-parameter family of normal 
distributions, given by the expression 

= exp |-ln V2na‘^ - 

These distributions form a two-dimensional manifold for which p, and a > 0 
are used as parameters. (The model map will not be named explicitly in 
this example and so the symbol p is free to be used for the mean of the 
normal distribution, as is conventional.) Some sources use as the second 
parameter, leading to some minor differences between the results obtained 
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there and the ones derived here. 

The best £t will be songht by minimising the Knllback-Leibler divergence 

r)( 

D{p\\p^,,a)=| p{x)\n-i--^dx (5.1) 

J-oo Pe{x) 

= -S{p) + In \/27rcr2 + ^Ep[(a; - pf], 

where S represents the Shannon entropy 03.81) of the probability distribntion 
p. The model map here is entirely implied by the divergence: there is only a 
single normal distribntion minimising this divergence for any of the empirical 
distribntions serving as the hrst argnment and this global minimnm is also 
a local minimnm. The derivatives themselves satisfy 

df,D{p\\p^^^) = — \Ep[x - p], 

daD{p\\pp,^) = ^ - - pf]. 

The condition that these expressions vanish simnltaneonsly yields the well 
known-expressions 

Ep[a;] = p, (5.2) 

Ep[(x - pf] = (5.3) 

In order to determine the Fisher information metric, it is reqnired to know 
the matrix of second derivatives of the relative entropy. These satisfy the 
expressions 

dlD{p\\pp^^) = 

2 

d^dpB(pjjpp^^) = -^Ep[x - p], 

dlD{p\\pp^^) = +^Ep[{x-pf]. 

Both expectation valnes appearing in this matrix can be rewritten as fnnc- 
tions of the parameters only. This is achieved by nsing the conditions fl5.2p 
and fl5.3p . As snch it is implied that the metric is well-dehned and has com¬ 
ponents 

, , 1 , , , , 2 

PfifiiPi 5'Mcr(P', O') 0, g^jayPi (x) 

<j (j 

The next step in constrncting the data set model geometry is to determine 
the connection coefficients. This reqnires the identification of distribntions 
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for which all but one of the derivatives of the divergence vanish. Let us 
hrst turn our attention to hnding which is used to hnd the connection 
coefficients u^ij. Such a distribution must satisfy 

7^ 0 and d^D{p^^^'^\\p^^„) = 0. 


It is thus necessary to hnd a distribution with a mean different from p but 
with its expectation value of (x — /i)^ equal to Take the mean of p^^'^ to 
be equal to p + 5p. This can be used to compute the connection coefficients 
u^ij through equation fl4.12l) . The result is 






2 a ^{Sp) — 0 
—a~‘^{6p) 


a 


UJ^aa{l^,Cr) = 0 . 


The other three independent connection coefficients oo'^ij can be computed 
in an analogous way. This requires the choice of a data set p^^\ which must 
have p for its mean but {a + Sa)‘^ as its second central moment. Substituting 
this distribution again into equation fl4.12p reveals the expressions for the 
last three independent connection coefficients to be 


0 , 

0 , 

—a~‘^ + 3a~'^{a + Sa)'^ — 2a~'^ 3 

a~^ — a~^{a + Sa)'^ a 

It can be verihed quickly that these coefficients are indeed the correct ones 
in order to write the metric tensor as the Hessian of the divergence function, 
regardless of the chosen empirical distribution p. Take, as an example, the 
component g^a. A straightforward computation yields 




Va'VaD{p\\p^^^) 

= dlD{p\\pf,^^) -uj'^^^{p,a)daD{p\\pf,^^) 

= —a~^ + 3(T“^Ep[(x — pY] — (— 3cr“^) [a~^ — (j“^Ep[(x — /r)^]) 
2 

cr^ 


as desired. This data set model can thus be concluded to feature a Hessian 
structure. 
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It is shown in the text of the previous chapter that this Hessian structure im¬ 
plies the connection to be flat and torsionless, as well to satisfy the Codazzi- 
Peterson-like equation fl4.7|) . This can be verihed explicitly, but it is more 
instructive to hnd the affine coordinates of the connection. The function 
(/i, a) I—)■ 0*(/r,cr) expressing these affine coordinates must satisfy three par¬ 
tial differential equations, given by equation fl4.17p . In this example, these 
equations look like 


= 0 , 

d^d^Q\p,a) = -2a~^d^e\fi,a), 

dlQ\p,a) = -3a~^daQ\iJ.,a). 

From the hrst of these equations, it follows that 

&{p,a)=A\a)p + B\a), 

where the functions and still need determining. Substitution into the 
third partial differential equations yields 

pdlA^ + dlB^ = -3pa-^d„A^ - 3a-^d„B\ 

Since this must hold for all values of p, one obtains two independent equations 
for and Hh The solutions of these demand that both and are 
proportional to The second of the above partial differential equations 
is then also satished. It is now possible to choose the constants and in 
such a way as to obtain the well-known canonical parameters 
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5.1.2 Using a different divergence function 


The simplicity of this example offers a good opportunity to study the effects 
of using a different divergence function—in this case a divergence function 
different from the relative entropy fIS.ip . The set X of data, the model mani¬ 
fold M and the model map p will thus remain unchanged. An obvious choice 
for the new divergence D' is the expression 

D\p\\Pi,,a) = + ^ (/^^ + - ^p[Af > ( 5 - 4 ) 

//Iq ncTg 

where po and cxo are strictly positive constants which may be arbitrary oth¬ 
erwise. These constants are introduced for dimensional reasons but they can 
also be used to change the relative sensitivity of the model map with respect 
to the values of Ep[x] and Ep[a;^]. The derivatives of this divergence satisfy 

d^D'{p\\p^,^^) = \{p -W.p[x\) + -^{p^ + -¥.p[x^]), 

d'O ^0 

daD'{p\\pp^^) = ^{p^ + CT^ - Ep[a;2]). 

^0 

For these two expressions to vanish simultaneously, it is required that 


Ep[a:] = p, 

Ep[a;^] = + p^. 


By a straightforward computation it can be seen that these two conditions 
are equivalent to 05.21) and 05.31) . This means the model maps implied by the 
divergence functions 05.ip and 05.41) coincide as intended. 

To compute the metric tensor the matrix of second derivatives of the 
divergence is again needed. Another straightforward computation yields 


dlD'{p\\p,,.) 

dpd„D\p\\p^,,a) 

dlD\p\\pp,.) 


1 3/x^-|-cr^ — Ep[x^] 

2 4 ” 4 

Fo ^0 

2 pa 


+ 3(7^ — Ep[x^] 


In order to determine the metric tensor, the distribution p needs to be chosen 
in the hbre of P/s^fj,- This means Ep[a;^] = p"^ + a'^ and so the metric tensor 
has components 


^ , 1 

a^ Fo’ 




2pa 


a. 


A ’ 





















5. Examples and applications 


65 


Remark that the sum of squares divergence 05.41) yields a more complicated 
metric tensor than the apparently more complex Kullback-Leibler divergence. 
It is also easy to verify that the connection constructed from the divergence 
D' is not the metric connection, since the latter does not vanish—a fact that 
can be checked equation 02.15p . These complications are all in some way 
a consequence of the presence of the parameter p in the second term—a 
presence which is in itself necessitated by the expression for the second (non¬ 
central) moment of a normal distribution, which refers to both the mean p 
and the variance cr^. 

Since the matrix of second derivatives does not coincide with the metric 
tensor, it is necessary to compute the expressions 04.121) to find the connec¬ 
tion coefficients w^ab of V'. Since only two of the second derivatives of the 
divergence D' depend on the arbitrary distribution and then only on the 
second moment of this distribution, only and may be different 

from zero. The probability distribution required to compute these coef¬ 
ficients must have a second moment different different from —say 

{a + SaY + p^. Using again the definition fl4.12|) yields for the first coefficient 


^ ^ I - g^^{p, a) 

— [(cr -|- + p^] — 2p^ 

a{p^ -1- — [(a -|- 5aY -|- p^]) 

_ 1 
cr 

The second one is computed in a very analogous way to obtain 




I -gaa{p,a) 

/i^ -|- 3cr^ — [(cr -|- (5cr)^ -|- /i^] — 2cr^ 

Cr(p2 p 0-2 _ 


cr 


As such, the connection is well-defined but it is a different connection than 
the one obtained when the relative entropy is used. Computing the curvature 
tensor explicitly allows one to verify that this connection is flat. However, 
finding the affine coordinates of this connection will have this as a corollary. 
To find these special coordinates, the differential equations (14.171) can be used 

















5. Examples and applications 


66 


and for this particular example they take the form 

df,d^Q\p,a) = 0 , 

dle\p,a) = a~^d^e\p,a). 

The second of these equations demands that 

Q\p,a) = A\p) + B\a). 

Inserting this in the third equation yields a differential equation for 5*, 

adlBi{a) = d„B\a) 

and thus The hrst equation is then satished only if 

dlA\p) = 2BI. 

A particular solution A* = A^p is obtained when B^ = 0, whereas the general 
solution is A® = BqP^. As such two canonical parameters are revealed to be 

1 1 2 2 I 2 

rj = p and rj = p + a . 

These are the so-called “expectation parameters” of the family of normal 
distributions m- 

Note that this treatment never used the explicit expression for the model 
distributions. A similar “sum of squares” divergence function may thus be 
dehned in a very broad range of cases. The approach does have the disad¬ 
vantage of lacking interpretation, however. To take a concrete example, the 
family of Gumbel distributions—discussed in details in the last example— 
could be adopted as a model when endowed with this same divergence fl5.4p 
as they too can be distinguished uniquely by their hrst and second moments. 
In fact this last example shows, mutatis mutandis, that this form of the di¬ 
vergence will always yield a hat and torsionless connection—at least when 
the squares in fl5.4p contain sufficiently well-behaved expressions. 
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5.2 Linear regression 

Perhaps the simplest and best known-way of fitting a functional relation to 
data points is linear regression. The input data in this example takes the 
form of a set of couples {xj,yj) which are believed to exhibit a functional 
relationship in principle but which have been contaminated by some form of 
noise. It should be noted that nothing in this example is new—this is merely 
an illustration that a wide array of problems hts into the data set model 
formalism. 

The set X contains all collections S consisting of Ns couples {xj,yj) G M^. 
Remark that these data sets S must are not required to contain the same 
number of couples, however they must satisfy 

^0. 

j ^ j ' 

The model points are the hrst order polynomials of the form 

fa,b{.x) = ax + h, a, & G M. 

The numbers a and b, indicating the slope and intercept of the polynomial 
function, are also used as the coordinates for the model manifold M. The 
obvious choice of divergence, which must indicate the best htting model, is 
to adopt the quantity which must be minimised in the least squares method, 

D{S\\fa,b) = ~ ■ (5-5) 

j 

Minimising this function may happen by setting equal to zero the derivatives 
of this divergence function, which yields 

daD{S\\fa,b) = - ~ 

j 

dbD{S\\fafi) = - - aXj - b). 

j 

Through a straightforward computation, it is found that these derivatives 
vanish simultaneously when 


Ej VjXj - Ej Vi 


1 

Ws 


^{yj - axj). 
j 
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These are of course nothing but the regular expressions for the slope and 
intersect of the best £t to the data S. Thus, the data set model formalism 
contains this aspect of the least squares linear regression method. 

The next interesting quantity is the metric tensor. To compute this quantity, 
knowledge of the matrix of second derivatives of the divergence is needed. 
The independent components are given by 

32C(S||/.,,) = 

j 

dadbD{S\\fa,b) = 

j 

d!D{S\\Ub) = Ns. 

This matrix depends only on the data and not on the parameters a or b. This 
is not in itself a problem, as it might in principle be possible to express these 
quantities only as a function of those parameters by restring these expressions 
to data sets for which the best fit is a given line. However, it is clear that 
this will not be possible: the best £t is determined also by the ^/-values of the 
couples in the data set and there is no mention of those numbers in the matrix 
above. As a consequence, the subsets of X on which the above expressions 
are constant are not related to the hbres of the model. This shows that the 
divergence fl5.5p does not satisfy the necessary conditions for a divergence in 
the data set model formalism. 

Another divergence which yields the same model map was already mentioned 
in an earlier publication [71]. It is given by 

Dx{S\\fa,b) = ^ -Vk- a{xj - Xk)f 

+ 2Y, ki^i-^kf - XkVj - b{xj - Xk)]^ 

where A > 0 is an arbitrary constant introduced for dimensional reasons. 
The matrix of second derivatives of this divergence is particularly simple, 
with constant components wholly independent of the data. This means this 
matrix necessarily coincides with the metric tensor, which is characterised 
by 

gaa{a, b) = A^ gab{a, b) = 0 , Qbbia, b) = 1 . 

Since both objects coincide, all connection coefficients will vanish and the 
coordinates already in use are affine coordinates for this connection. The 
model manifold is therefore concluded to be endowed with a flat and torsion¬ 
less connection. 
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5.3 The grand canonical ensemble for identical particles 

In this first non-trivial example, the goal is to model a system of non¬ 
interacting bosonic particles which may occupy states {j} with correspond¬ 
ing energy levels {Sj}. This is a well-known and extensively studied model. 
Hence, no surprising results are to be expected. Nevertheless, such a familiar 
example may be interesting for the reader as it applies the developed form¬ 
alism to a system he or she may be already be familiar with. 

The set X of all data sets contains possible outcomes of an experiment to 
measure the number rij of bosons occupying the state j. These will be de¬ 
noted by {ndi or just n where the context makes confusion unlikely to occur. 
(The letter n is not used for the dimension of M in this example. A similar 
remark holds for p introduced soon.) It will follow implicitly from the dis¬ 
cussion that not all possible configurations n are in the domain of the model 
map. Such conhgurations are excluded from the onset. The model points 
making up the manifold M are the distributions of the grand canonical en¬ 
semble. That is, they are the probabilities of the states j having occupations 
Uj and they are given by 



(5.6) 


where Z is the partition function of the system—also serving as a normal¬ 
isation factor—, d represents the inverse temperature and p the chemical 
potential. It is a well-known result of statistical physics, see for example [1], 
that 



where the expression between large parenthesis is an abuse of notation to 
indicate a multitude of sums. It is assumed that the remaining product 
converges. This is not a trivial assumption but it is not native to the data set 
model formalism and so no excessive attention will be given to this problem. 
From expression fl5.6jl it is clear that the model distributions belong to the 
exponential family with parameters 9^ = d and 6“^ = —dp. However, it is 
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more instructive to use f5 and p as parameters or coordinates of the model 
manifold as this will illustrate that the geometric properties of the model are 
indeed independent of the choice of parameters. 

The geometry of a data set model is completely determined by the chosen 
divergence. The choice made here is given by the expression 

IP/J.m) = Inh) - + (dp). (5.7) 

i 

For a realistic experimental outcome there is a highest occupied energy level 
and a hnite number of particles, thereby avoiding additional mathematical 
difficulties with this dehnition. 

It hrst needs to be verihed whether or not equation 05.71) does indeed dehne 
a divergence. The expression is continuous and continuously differentiable as 
a function of the parameters when /3 > 0 and p ^ {sj}- Since from a physical 
point of view this model only makes sense when /d > 0 and p < min^ no 
difficulties are expected to be encountered in a practical setting. 

The derivative of the divergence 05.71) with respect to the inverse temperature 
(3 is given by the expression 

dpD{n\\p^^^) = di3\nZ{l3,p) + - p'^rii 

i i 

Through an analogous computation, it is found that 
df,D{n\= df. In Z{(3, p) - 

i 

= ( 5 . 9 ) 

These two derivatives must vanish simultaneously if n is to be contained 
in the hbre of the grand canonical distribution pp^fj,- Rewriting the above 
expressions shows this is equivalent to demanding that the total energy and 
total number of bosonic particles can be expressed through the well-known 
relations 

E"< = Eexp{/3(£,-M)}-r 
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Whether or not a given data set n is acceptable depends on whether or not 
these equations have solutions for (3 and p for this data set. The second deriv¬ 
atives are also important to the data set model formalism. A straightforward 
computation yields 


d0d^D{n\\pp^f,) 


dlD{n\\pp^^) 


E 

j 

E 


(gj - 

(exp{/3(ej -/i)} - 1)2 
1 

exp{l3{ej - p)} - 1 


exp{/3(£j 






Sj — fJj 


f (exp{/3(£j -/i)} - 1)‘ 


■ exp{/3(£ 


/?’E 


(exp{/3(£j - p)} - 1)' 


■ exp{/5(£j — 




The mixed derivative still depends on the data set n. However, this de¬ 
pendency only involves the sum of all occupation numbers and can thus 
be rewritten using the equalities right above when n is in the hbre of pp^^- 
Furthermore, this will cancel out the hrst sum in the expression, thereby sim¬ 
plifying the derivative. This means the metric tensor can easily be written 
down as 


= E 

j 


exp{P{ej- p)} 
{exp{/3{ej - p)} - 1)2 


{ej - /i)2 -^{ej - p) 

-/3{ej-p) (3^ 


(5.10) 


This metric can be shown to be positive dehnite through direct computation. 
For any vector F G T^ it holds that 


g(v,v) = 


exp{l3{ej - p)} 
{exp{/3{ej - p)} - 1)2 


{v^{£j — p) — I3 v^Y ^ 0- 


The next step in the search for the canonical coordinates of the model dis¬ 
tributions is to determine the connection. Since only the mixed derivative of 
the divergence depends on the data set n, it can quickly be seen that 

= 0. 

The two remaining independent coefficients, and are only slightly 
harder to find. X is a discrete set, but since it can be thought of as embedded 
in with N (at least) the number of available energy levels if this is finite, 
it is possible to use either the expression (14.13p to compute the coefficients 
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or the discretised version 04.121) . It is the latter expression that will be 
used here. This means hxing values for the parameters [3 and p and then 
hnding appropriate data sets with which to compute the coefficients of the 
connection. The computations of connection coefficients each include two 
data sets: one data set n which is contained in the hbre of and another 
one denoted as for which 


or another invertible matrix if this is more practical. Since the derivatives of 
the divergence only depend on the data sets through the quantities 


and y^njEj, 

i i 


which are constant in the fibres, it is not too difficult to hnd these “off-fibre” 
data sets. The condition for is that 

^ 0 and = 0. 

From the second of the conditions, it can be seen that 


5:t’=5:".=e 


exp{(3{ej - p)} - 1 


This is enough to compute the the connection coefficient given by 

/3 ^ - dgd^D{n\\pf}^f,) 

^ di:iD{n^3)\\pi3^^,) 

= 0 . 


This means only the coefficient remains to be determined. The condi¬ 
tions on the data set can be expressed as 

7^ 0 and 9/3/1 | = 0. (5.11) 


The hrst condition implies 


j 


1 

exp{^{ej - p)} - 1 




i 
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must differ from zero. Since is arbitrary within the constraints fIS.lip . 
the expression between parentheses can be chosen to be equal to 1. This 
suffices to determine the hnal (and only) independent connection coefficient, 
being 


u _ - dpd^D{n\\pp^^) 

_ 1 

By simple substitution it can be verihed that this connection will indeed 
make the Hessian of the divergence independent of its hrst argument and 
equal to the metric found before, as it is given in expression fIS.lOl) . 

To hnd the canonical parameters making up the affine coordinates for this 
model manifold, attention is turned towards equation 04.171) . The partial 
differential equations to be solved in this example are 

= 0 , 

dl& = 0 . 

From the hrst and the last of these equations, it follows that 
0*(/3, p) = A^p(3 + + D\ 

where the Roman capital letters are constants. The second differential equa¬ 
tion demands that + B^) and thus R® = 0. The other three 

constants may be chosen freely and with a proper choice one obtains the 
canonical parameters mentioned earlier, 

9^ = I3 and 6*^ = —/3p. 

Since the connection is hat and its coefficients can be expressed in a fairly 
simple way in the coordinates /3 and /i, it is worthwhile to see what the 
geodesics and covariant constant vector helds would look like in the coordin¬ 
ate system determined by /3 and p. 

Given a vector v at one particular point of the manifold M, it is possible 
to construct a covariant constant vector held everywhere through parallel 
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transport. Along an arbitrary curve parametrised by t, the covariant con¬ 
stant vector field satisfies the equations 


dv^ 

dt 


0 and 


dn^ 

dt 


f3 dt f3 dt 


The first equation implies that has the same value everywhere on M. Be¬ 
cause this component is constant, it is possible to rewrite the second equation 
as 


d / l3v^\ dp 

dt V / dt’ 


This differential equation has for its solution 


n^(t) 


Pq pit) 0 

m ’ 


where po is an integration constant which can be determined by choosing the 
value of the vector field at a given point. An example of a covariant constant 
vector field for this connection is depicted in the drawing. This may not look 
to the reader as a vector field which is parallel with itself everywhere but this 
is in fact the case. The origin of the possible confusion is that a coordinate 
system (and thus a coordinate frame) is chosen for the illustration in which 
the connection coefficients do not vanish. 


p 






\ \ ^ 
\ X ^ ^ 

\ ^ 


/ X 


/ / X X JT 

/ / / X X 


/S 


In order to get a better understanding of the behaviour of this connection, 
also the geodesics can be considered. The general geodesic equations are 
given by the two non-linear differential equations 


d^P 

df2 


di^p 2 d/9 dp 

df^ /9 dt dt 


0 and 
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From the first equation it follows that /9(t) = At + B. This allows the second 
differential equation to be rewritten into the form 



o -4 d/r 
At + B dt 


it). 


A particular solution is achieved when A = 0, making /d constant and fi{t) = 
Ct + D. Otherwise, this equation can be solved by elementary methods such 
as separation of variables and order reduction to obtain the result 


/i(t) = p{to) + /i(to) 


Ato + B At + B 


This family of curves excludes the (3- or 0^-curves, which are a particular solu¬ 
tion to the geodesic equations corresponding to A = 0. The drawing below 
shows a selection of canonical coordinate curves—all of which are geodesics— 
as they appear in the original coordinate system of the parameters [3 and p. 
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5.4 The von Mises-Fisher distributions 

This example is divided into two separate modelling problems. In both cases, 
the manifold is a subset of an exponential family. This is very instructive as 
the geometry of these submanifolds as induced by the divergence is exactly 
what is expected from a submanifold of a Euclidean space—even though the 
containing space is not endowed with a Euclidean metric. 

5.4.1 Fixing the width of the distribution 

The von Mises-Fisher distributions are families of distributions on the sphere 
[85] . That is, they belong to the set of distributions p for which 

= (5.12) 

i 

In particular, the von Mises-Fisher distributions take the exponential form 
PkA^) = exp{-$(K) 4- K^Xi} where = 1- (5.13) 

i 

This notation employs the traditional parametrisation but it is easy to see 
that this is an exponential family by taking as the canonical parameters 
0 * = —Kp\ analogous to the situation for the grand canonical distribution 
of bosonic particles treated previously. It is not the intention to repeat the 
above example with a few extra degrees of freedom. Instead the computations 
here will be restricted to three-dimensional data and—more importantly— 
only a subset of the von Mises-Fisher distributions will be considered as the 
model manifold M. In particular the parameter k, which is a measure for 
the width of the distribution, will be held constant. The manifold M thus 
obtained is topologically equivalent to the 2 -sphere, which can be seen by 
employing a parametrisation by spherical polar coordinate^ 9 and (p, 

Pe,ifi{.x) = exp{—<h(K) -|- K(sin( 6 ') cos(</ 9 )a;i - 1 - sin( 6 ') sin((y 9 )a ;2 -|- cos( 6 ')a; 3 )}. 

The choice for the symbol 6 is made in accordance with the usual names of 
the spherical polar coordinates and it should not be mistaken for a canonical 
coordinate of an exponential family. 

The set X of data sets is the space of all possible probability distributions 

^ The usual remarks for the use of this parametrisation apply, that is when 0 = 0 
or 0 = TT, the coordinate y is not well-defined and so 0 and y do not form a proper 
coordinate system. As this will not hinder the treatment of this example too much, this 
parametrisation will be used nonetheless. 
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on the sphere as defined by equation fl5.12p . The model manifold M is the 
two dimensional set of von Mises-Fisher distributions with a fixed value of 
K. As the divergence D it proves convenient to opt for the Kullback-Leibler 
divergence function, 

= -^(p) - / p(a:) [-$(«:) + np^Xi] dx 
Jx 

= -S{p) + <h(K) 

— K(sin(6') cos(<p)Ep[xi] + sin(6') sin((p)Ep[x2] + cos(6')Ep[a;3]), 

where as usual S{p) is the Shannon entropy fl3.8l) of the probability distribu¬ 
tion p. The derivatives of this divergence are given by 

d0D{p\\p0^^) = -k{cos{6) cos(<p)Ep[a;i] - 1 - cos(6') sin(p)Ep[x2] - sin(6')Ep[a;3]), 
d^D{p\\p0^^) = -K{-sm{6) sin((p)Ep[a;i] + sm{6) cos((p)Ep[a:2]). 

Though it is formally possible to solve these equations together with the nor¬ 
malisation condition 05.121) to find which distributions p make up the fibres 
of the von Mises-Fisher distribution there is an easier way. The diver¬ 
gence D{p\\p 0 ^^) as defined above is minimal when the expression p*Ep[a;i] 
attains its maximal value. Since this can be viewed as the inner product of 
two vectors of unit norm, the maximum is obtained when the two vectors are 
equal, that is 

Ep[xi] = sin(6*) cos((p), 

Ep[a; 2 ] = sin(6') sin((p), 

Epl^s] = cos(6*). 

Note that there is also another set of distributions p for which the derivatives 
of D{p\\p 0 ^^) vanish—those distributions for which the values Ep[xj] take the 
negative of the values above. This is a local maximum of the divergence 
function, however, and hence it is no candidate for the required solution. 
The second derivatives of the divergence are found to equal 

^0-D(p||pe,(p) = K(sin(6') cos((p)Ep[a:i] -f sin(6') sin((p)Ep[x2] cos(6')Ep[a:3]), 
d0d^D{p\\p0^^) = k(cos( 6') sin(p)Ep[a;i] - cos(6') cos(p)Ep[a;2]), 
dlD{p\\p0^^) = K{sm{e) cos(p)Ep[a;i] -f sin(6') sin(p)Ep[a;2]). 

Combined with the expressions for Ep[xj] that make up the conditions for 
p to be an element of the pe^p-fibre, knowledge of these second derivatives 
allows for the metric tensor to be expressed. Its components read 

geeiO, ip) = K, g0^{e, ip) = 0 , ip) = Ksiv?{e). 
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This is the metric tensor of a 2-sphere with radius This is a clear ex¬ 
ample of a model where the geometry introduced by the divergence coincides 
with the geometry that is expected from the manifold M itself, even though 
the manifold itself is not isomorphic to However, this does not ensure 
the connection V will be the metric connection on such a sphere. In fact, 
at this point there is no guarantee that a connection as introduced in the 
previous chapter would even exist in this example. A formal investigation is 
thus required. 

The determination of the connection coefficients will proceed by hrst hnding 
curves e —)■ X^{e) through the space X. Both the hrst and second derivatives 
of the divergence function in this example depend only on the expectation 
values of the random variables Xi, which means it suffices to focus on these 
expectation values. However, these values are not independent: they too 
he on a sphere just as the parameters /x* do. The use of this knowledge re¬ 
veals that the result that one family of desirable curves can be obtained (as 
functions of a parameter ip) through the expressions 

Ep[a;i] = sin(6* — ip) cos(v9), 

Ep[a; 2 ] = sin(6* — ip) sin((/)), 

Epl^s] = cos(6* — Ip). 

Analogously, the second family of curves is obtained (as functions of 

Ep[a;i] = sin(6') cos(v9 — .^), 

Ep[x 2 ] = sin(6') sin(99 - ^), 

= cos (6'). 


When either of these parameters are equal to zero, the curves hnd themselves 
in a point representing a hbre of the model as intended. Using these 
curves over the sphere of the expectation values, it is found that 


d^D 
dip 89 


= 

y=?=o 


d‘^D 

dP^dO 


p=i=o 


0 , 


d‘^D 


dipdip 

d^D 


didif 


P=i=o 


P=i=o 


0 , 

fi;sin^(6'). 


This yields exactly the metric tensor, even though this is a coincidental con¬ 
sequence of the choice of the rather than a general property—another 
choice of curves would have been equally valid but would have yielded differ¬ 
ent expressions. While the result is not the Kronecker-delta one would desire 
under ideal circumstances, the invertibility of the metric tensor means these 
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curves can indeed be used to define the connection coefficients. The following 
step is to use the above expressions for the curves through the expectation 
value sphere in the expressions for the second derivatives of the divergence. 
The resulting quantities can be derived with respect to '0 and ^ to obtain the 
coefficients Uk^ij = Qks^^ij- The usual coefficients can then be found through 
a multiplication with the matrix inverse of the metric tensor. There are six 
derivatives to compute. These are—again suppressing some straightforward 
function arguments in order to reduce the burden of notation— 


d d‘^D 
dijj d6‘^ 
d d^D 
dip 39 dip 
d d'^D 
dip dif'^ 


P=5=o 


p=C=o 


p=€=o 


0 , 

0 , 


—Ksm{9) cos{9), 


3 3‘^D 


dp, 36“^ 

y=5=o 

d d^D 


dp 39 dip 

V>=5=o 

d d'^D 


dp dip'^ 

p=€=o 


0 , 

k,cos{9) sin(6'), 

0 . 


Since the matrix multiplication that needs to be performed involves a square 
matrix, this operation is straightforward and the two independent non-van¬ 
ishing connection coefficients appear as 


^ ip(p{9, ip') 


sin(6') cos(6') and u‘^0^{9,ip) 


cos(6*) 
sin(6') ’ 


These coincide with the coefficients of the metric connection on a spherical 
surface (see for example [86]). It is not necessary to compute the curvature 
tensor to see if it vanishes: the reader will no doubt agree that the familiar 
metric connection of a spherical surface will not be found to exhibit flatness. 
The example is thus a very instructive one. It shows that for a properly 
chosen divergence function, the geometry introduced by the formalism of 
data set models will coincide with the geometry expected from the choice of 
the model manifold M, something which may come unexpected from readers 
familiar with information geometry. After all, a prominent role there is played 
by connections which are not metric. Also, since this connection is not fiat, 
it is an example of a data set model where a connection exists but where the 
metric is not of the Hessian type. 
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5.4.2 A cylindrical submanifold 

With the knowledge obtained in the previous example regarding von Mises- 
Fisher distributions, it is easy to consider a different submanifold of the 
three-dimensional family fl5.13p . In particular, this example will treat a sub¬ 
manifold homeomorphic to the half-cylinder, that is the two-parameter family 
of distributions 


Pip,x(x) = exp{-S(K, A) -1- K(cos((p)a:i -h sin((p)a; 2 ) - Xx^}, 

where k is again a positive constant and A > 0 is now an independent para¬ 
meter. This distribution is essentially a product of the one-dimensional von 
Mises(-Fisher) distribution and the exponential distribution. The motivation 
for computing this very similar example is to see if here a flat connection fol¬ 
lows. This might be expected from the analogy in the previous example, 
where the geometry of the model manifold coincided with the sphere with 
which it is homeomorphic. Cylinder mantles are known to have flat metric 
connections—a fact which follows from their ability to be unrolled onto 
without distorting the intrinsic geometry of the surface. 

The data sets which will be modelled by these new cylindrical von Mises- 
Fisher distributions are those distributions p for which 

lEp[a;i]^ -|- Ep[x 2 ]^ = 1 and Ep[x 3 ] > 0. 

Unlike in the previous example, it is actually necessary to know the normal¬ 
isation function S in order to complete the computations. The distribution 
must be normalised and thus 


1 = / exp{—S( k, A) -|- k(cos((p)xi -|- sin((p)x 2 ) — Axsjdx 

Jx 

= exp{—S(fi:, A)} / exp{K(cos(</9)xi-f-sin(93)x2)}d£ 
isi 

poo 

X / exp{—Axsjdxs 

Jo 

27iIo{k) 


= exp{-E{K, A)}- 


A 


where In is the modified Bessel function of order 0 IHS]. This means 


A) = In 


27iIo{k) 

A 


The remaining part of the computation is analogous to the treatment above 
and so less details will be supplied so as not to burden the reader. The choice 
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of divergence is again the Kullback-Leibler divergence, which has derivatives 
in this example given by 

d^D{p\\p^^x) = -K(-sin((p)Ep[xi] + cos((p)Ep[a:2]), 
dxD{p\\p^p) = -j +Ep[a;3]. 

The distributions p in the p<p^A-fibres are thus those distributions satisfying 

Ep[xi] = cos{ip), 

Ep[x2] = sin((y9), 

Ep[x3] = 

The second derivatives of the divergence are equal to 

dlD{p\\pp^x) = K(cos(v3)Ep[a;i] + sin(<y9)Ep[a;2]), 

d^dxD{p\\p^p) = 0, 

dlDip\\pp,x) = X-^. 

On the hbres these expressions are constants and so the metric tensor takes 
the form 



The ^fip^-component, as well as the off-diagonal components of this metric 
are what are to be expected for a cylinder with radius The \~^ serving 
as the 5 fAA-component may seem unexpected for a cylinder, but this should 
come as no surprise given that the model distributions attach an exponential 
probability density to the random variable 0 : 3 . 

In order to compute the connection coefficients, proper curves must be found 
along which to compute the derivatives serving as the definition of the con¬ 
nection coefficients—^just like in the previous example, preference is given to 
this method for practical purposes. Since only the second derivative with 
respect to p of the divergence depends on the distribution p all coefficients 
except and can already be seen to vanish. The first of these two 
will also vanish. Indeed, the appropriate derivative must be taken along a 
curve which is parametrised by Ep[x 3 ]—as that is the only p-dependency of 
dxD{p\\ppp )—but dpD{p\\p^^x) does not depend on this expectation value, 
making the derivative in fl4.13p vanish. In order to determine the last coeffi¬ 
cient, take curves parametrised by ^ such that along this curve 

Ep[a:i] = cos{ip - ^), Ep[x2] = sin((p - ^), Ep[a:3] = cte. 
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analogous to what happened in the spherical example. Then, along this 
curve, the second 9 ?-derivative of the divergence takes the form 

dlD{p\\p^p) = k{cos{(p) cos{(p-^) +sin(v 3 )sin(v 3 - 0 )- 

Deriving this expression with respect to ^ and evaluating in .^ = 0 shows that 
also this coefficient vanishes. This means that the chosen parameters p and 
A are indeed the affine coordinates of the connection and thus the canonical 
parameters of the distribution. 

Since all the connection coefficients vanish identically, it is not necessary to 
compute the curvature tensor. However, it is possible to check whether or 
not the Codazzi-Peterson-like equation fl4.7p is satished. As the computation 
is already taking place in an affine coordinate system, it is sufficient to verify 
whether or not 


dxQ^x = d^gxx- 

A quick peek at the coefficients of the metric tensor teaches us that all four of 
these quantities vanish and so both equalities are satished. This would mean 
that there does indeed exist a Massieu function <h, the Hessian of which is the 
metric tensor. This may come as a surprise since the cylinder has a periodic 
nature in one direction and so it seems impossible that there exists a properly 
behaved convex function everywhere on the cylinder mantle. The answer to 
this is of course that there is no such function. After all, the coordinate p 
is only a proper coordinate if a curve parametrised by A is removed from 
the half-cylinder mantle and the argument leading to the existence of the 
Massieu function made use of Poincare’s lemma, which holds locally rather 
than globally. When the points with coordinates (tt. A) are removed, a proper 
coordinate is obtained and then 

<h((p. A) = 2 — In A 

is indeed a convex function of which the matrix of second derivatives equals 
the metric tensor. This is a consequence of the local nature of differential 
geometry and one should be careful for this caveat in applications. 
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5.5 The Gumbel distributions 

The last example in this chapter considers the Gumbel distributions EH as 
models for empirical probability distributions. As such this example is very 
similar to the first one. The biggest difference here is that the chosen statist¬ 
ical model is not an exponential family. This means that the construction of 
the geometrical quantities of the data set model formalism will fail. In this 
way, it will be conhrmed that the model is indeed not an exponential family. 
The Gumbel distributions form a two-parameter family of probability dens¬ 
ities which are often employed to model the distribution of the minimal or 
maximal values of a number of statistical samples. The Gumbel distributions, 
whose domain is the set of real numbers, can be written as 

Pa,ii{x) = expjlna — a{x — p) — 

The parameters are a and p. The first one of these, is a strictly positive 
parameter determining the shape of the distribution. For reasons of nota- 
tional convenience, the choice for a differs from the one in the literature, 
where (3 = a~^ is more commonly used. The effect of varying the parameter 
a, for yU = 0, is sketched in the illustration directly below. It can be seen 
that larger values of a indicate a sharper distribution. 



The parameter p represents the mode of the distribution—and not the mean, 
as is common for Guassian distributions. The mean of the Gumbel distribu¬ 
tion equals p where 7 is the Euler-Mascheroni constant. The effect 

of changing p would be to perform a horizontal shift of the distribution over 
a distance p and so no distributions with non-zero /i-values are included in 
the illustration. 

It is tempting to choose as the set X again all possible distributions over the 
real numbers. Unfortunately this is untenable as this would mean certain 
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expectation values appearing in the computation below will not exist. This 
is not a problem specihc to the data set model formalism and so it will be 
ignored by assuming that all distributions p G X are such that the expecta¬ 
tion values mentioned below do indeed exist. 

Due to the appearance of the exponential functions in this expression, an 
obvious choice for the divergence is again that of Kullback and Leibler. This 
divergence, with a suitable probability distribution p as its hrst argument 
and a Gumbel distribution as its second argument, takes the form 

D{p\\Pa,^,) =-S{p) - / p{x)lnpo,,^{x) dx 

Jx 

= —S{p) — f p{x) [lira — a{x — p) — dx, (5.14) 

Jx 

where S represents the Shannon entropy fl3.8p . In order to construct the 
differential geometry induced by this divergence on the manifold of Gumbel 
distributions, the derivatives with respect to the parameters are necessary. 
They are given by the expressions 

^aD{p\\pa,^,) = -- + Ep[{x - p)] - -/i)], 

a 

^pD{p\\pa,^^) = -a + 

This means a distribution p is in the hbre of the Gumbel distribution Pq, ^ if 
and only if it simultaneously satishes the equations 

Ep[e-“(^-^)] = 1 and Ep[a(x - p){l - = 1. (5.15) 

Even though this is not a trivial computation, it can be verihed that when 
P = Pa,ti these equations are indeed satished. The second derivatives of the 
divergence fl5.14p with respect to the parameters are given by the expressions 

dlD{p\\po,^p) = + Ep[{x - 

d^dpD{p\\p^,p) = -1 + Ep[e-“("-^)] - aEp[(x - 
a2D(p||p„,p) = 

It is clear that these expressions in general depend on the chosen probability 
density p. To dehne a metric, it is sufficient that these expressions become 
p-independent when p is in the hbre of Pa,ii- However, this is not the case, 
as is shown by a counterexample. It is only required to show that the hrst 
of the derivatives, 

dlD{p\\pa,,p) = ^ +Ep[(x-p)V"(^“^)], 
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cannot be expressed withont reference to p, even when using the conditions 
fl5.15p imposed by the vanishing of the hrst derivatives of the divergence. 
This shows there exists no metric tensor as defined in the general theoretical 
outset of the previous chapter. An obvious choice of distribution to provide 
a counterexample, be it in the largest part for computational convenience, is 
the family of exponential distributions with density functions 

yg-Ax X ^ 0, 

0 a; < 0. 



Fixing a value of A, it is possible to determine which fibre of the manifold of 
Gumbel distributions contains px. This computation comprises most of the 
work needed in demonstrating the counterexample. 

The equations fl5.15l) impose relations between the parameters a and p on 
one hand and A on the other, in particular these are 


1 


Ae"^ 
A T ct 



(5.16) 


and 


l = aX {x- p){l - 

Jo 

poo poo 

= aX (x — p)e~^^dx — aXe°'^ / (x — 

Jo Jo 

a aAe"^ aA/ie“^ 

A ^ {a + Xy a + A 

This relation can be simplihed considerably by using fl5.16p to yield 

a a 

1 = -— ap -- + ap 

X q; -|- A 


(5.17) 


A(cr T A) 


Essentially a quadratic equation in a, this condition can be solved for its 
positive root 


A + \/A2 + 4A2 _ l + x/5 

--- —---A 


Negative a-values such as the one obtained by choosing the minus sign are 
excluded by definition of the Gumbel distributions and so this solution does 

















5. Examples and applications 


86 


not need to be given any attention. By substituting this result in relation 
fl5.16j) . an expression for p in terms of A can be obtained as 


/X = — In 


a 


A cr 
A 


1 

A 


l + \/5 


In 


'd + x/s' 


Hence the values of the parameters a(A) and /x(A) of a Gumbel distribution 
upon which p\ is mapped are known. It is now possible to show that the 
second derivative of the divergence fl5.14p with respect to a is not constant 
on the hbres. This derivative contains the term 

j■ + OQ 

¥.p^[{x — = / px{x){x — 


= Ae“^ / (x -/x) V(^+“)"dx 


= Ae"'^ 


2/x 


+ 




(A + (A + a)^ A + a 


0 5 

~ ^ (using a(A) and /x(A)). 

Since the Gumbel distribution Pa,p is an element of its own hbre, it must 
yield this same result when computing this expectation value if the metric 
tensor is to exist. A direct computation shows that 


E, 


'Pa,p 


[(x - /x)2e-"(^-^)] = a 


1 


(x — pY exp{—2a(x 


-/x) -e-“(^-^)}dx 


' —OO 

poo 


xx2e-2“---“ 


du 


' —OO 
/•OO 


1 


(Inf)^te *df 


7 


TT 

27 + y 


0,82 


It is concluded that—for this data set model—the suggested expression for 
the metric is not constant along hbres. This does not mean the manifold can¬ 
not be endowed with a metric tensor at all. For instance it remains possible 
to use as a metric tensor the matrix of second derivatives of the divergence 
restricted to M x M, evaluated on the diagonal. Using the equations fl5.15p 
to simplify the expressions, this would yield 


g{a,p) 


1,82 

7 


7 
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This metric tensor only contains information about the manifold of Gumbel 
distributions and, as shown above, is not compatible with the structure of 
the hbres of the full data set model. 

It is instructive to see if a connection could exist for this model. As the metric 
is not well-dehned, there is no Hessian structure. The spherical submanifold 
of the von Mises-Fisher distribution shows that such structure does not need 
to be present in order to dehne an affine connection, however. 

Once again, the hrst step in looking for a connection is to single out curves 
through the set X of distributions. The derivatives of the divergence are 
proportional to the expectation values 

and Ep[(a; -/i){l - 

and thus it is convenient to take these values as the parameters for the curves 
appearing in the strong version of Condition 6. Indeed, differentiating the 
derivatives of the divergence with respect to these parameters would yield 
the desired Kronecker-delta as a result. However, when trying to express 
the second derivatives of the divergence as a function of these parameters, in 
order to perform the necessary differentiation, a problem arises. Again it is 
the second derivative with respect to a which causes difficulties. As is stated 
above, this quantity is given by the expression 

dlD{p\\pa,f,) = ^ +Ep[(a;-/i)V“(^“^)]. 

Since this quantity is not constant within hbres, different choices of curves 
satisfying all required conditions are still possible but which yield dif¬ 
ferent results for the connection coefficients. Just as is the case with the 
metric, a certain choice of connection could be made but there exists no 
choice compatible with the hbre structure. 




6. CONCLUSION AND OUTLOOK 


6.1 Conclusion 

This doctoral research seeks to contribute to the formulation of an abstract 
theory of information that is not based on probability. Instead, the mathem¬ 
atical foundation is provided by differential geometry. The resulting frame¬ 
work is called the data set model formalism. 

Pursuing information theory without probability may seem counter-intuitive 
to many readers, not in small part due to the way this discipline is tradition¬ 
ally treated in textbooks. It is nevertheless an idea which has been advocated 
by a number of authors in the past. Recent interest in this question is mo¬ 
tivated by numerous attempts in the literature to base quantum theory on 
informational principles. This endeavour may be facilitated by the availab¬ 
ility of a sufficiently general and abstract perspective on information theory. 
Also advances in experimental quantum physics could in time vindicate the 
development of a formalism such as the one presented in this dissertation. 
These new results enabled by the recently mastered ability to perform weak 
measurements, may in time argue in favour of—or even demand—the for¬ 
mulation of a novel description of quantum information theory. In that case, 
having at hand a mathematical foundation such as the one provided by the 
data set model formalism could be benehcial to the involved research com¬ 
munity. 

The inspiration for this approach is found in information geometry, a held 
concerned with the description of probability theory and statistical models 
through differential geometry. The framework developed here is a proper 
generalisation of information geometry in the sense that the latter can be 
re-derived as a special case of the former. The most obvious way in which 
the abstraction is exercised is by dropping the demand that the data and 
the models are probability distributions. A related and in all likelihood more 
important feature is that the models need not be a subset of the data and 
may even be qualitatively different mathematical objects. Despite these dif¬ 
ferences, the construction of a geometric structure strongly reminiscent of 
information geometry is performed in this dissertation. 

The geometric structure of the data set model formalism is derived from a 



6. Conclusion and outlook 


89 


generalised divergence function which quantihes how well a data set is de¬ 
scribed by a given model point. More in particular, a Riemannian metric and 
an affine connection can be constructed under suitable conditions. The met¬ 
ric is a generalisation of the Fisher information metric and can be employed 
to express the sensitivity of the inferred parameters to measurable functions 
of the data. Consequently, when the data set model under consideration is 
an exponential family of probability distributions, the Cauchy-Schwarz in¬ 
equality for the metric tensor reduces to the well-known Cramer-Rao bound. 
The affine connection is flat and torsionless and it is a generalisation of the 
exponential connection taking up a prominent role in information geometry. 
Of central importance to the data set model geometry is a Hessian structure, 
where the metric can be written as the Hessian of a generalised Massieu func¬ 
tion. The point of view that the formalism is a proper generalisation of the 
existing literature is further reinforced by discussing a Pythagorean theorem, 
which provides the link required to derive the geometry of divergence func¬ 
tions as it has been developed by other researchers from the data set model 
geometry. 

The theoretical discussion is concluded by establishing a straightforward 
technique to determine whether or not a statistical model belongs to the 
exponential family and to determine the canonical parameters. While these 
notions are not expected to be original, they can be quite useful for someone 
interested in exponential families and the new formalism allows them to be 
applied in an elegant fashion. 

The last chapter of this dissertation is devoted to working out a number of 
examples. Each of these illustrates one or more prominent aspects of the 
data set model formalism touched upon in the preceding paragraphs. The 
examples vary from familiar probability theory and information geometry to 
a linear regression method. A particular example interesting to physicists is 
found third in that chapter. It concerns systems of non-interacting bosonic 
particles and it allows, given experimentally observed occupation numbers 
of the energy levels of that system, to hnd the grand canonical distribu¬ 
tion function which best describes the state of the system (at the time of 
measurement). Those readers more attracted to differential or information 
geometry may have their interest sparked in particular by the two examples 
in the fourth section. Two statistical models, both submanifolds of the von 
Mises-Fisher distributions, are studied there and have their data set model 
geometry constructed. The metric tensor and the affine connection obtained 
in this way are exactly the ones that would be expected if the submanifolds 
had been embedded in Euclidean space. This is a remarkable result since the 
containing space is not endowed with a Euclidean metric. 
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6.2 Outlook 

A number of interesting questions regarding data set models remain open. 
One question is whether a natural construction for a one-parameter family 
of affine connections exists for data set models, as is the case in informa¬ 
tion geometry. A better understanding of the remarkable results of the von 
Mises-Fisher examples may yield insights in properties of the geometry of 
data set models. Should these results hold in general, they could serve to 
detect statistical models which are submanifolds of exponential families. It 
may also enable a further expansion of the data set model formalism in order 
to treat also models with a connection which exhibits curvature. 

From the perspective of applications, the data set model approach may 
become a fruitful technique in quantum information theory. Recent—still 
unpublished—work shows that the work of Petz on positive-operator valued 
measures (see for instance US]) is also encompassed by the formalism. It even 
offers a ground for the belief that a further extension and a mathematical 
simplihcation of that research may be possible. Whereas I personally hold 
the opinion that the preceding suggestion is the application which looks the 
most promising—at least at the time of writing—the great flexibility offered 
by the data set model opens up the possibility for plenty of applications. 
As the examples of linear regression and of the non-interacting bosons show, 
the data set model formalism is very suitable to function as a mathematical 
framework for very general fitting procedures. For this reason it could also be 
useful to researchers in machine learning. Their discipline is concerned with 
a highly varied collection of different types of what are essentially modelling 
problems. In particular, a significant part of their field makes use of prob¬ 
ability theory whereas an equally important part does not. This community 
is thus familiar with many powerful techniques which are employed in the 
study of the former type, which could be extended to the latter type using 
the data set model formalism as a unifying framework. 
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